use atomic operations on the PCIe host/device shared memory?

Some PCIe devices (for example FPGA card) can expose segments of its physical memory via host's BARs and the host can access the memory region via the memory devices (on Linux, we can memory mapped the devices to virtual memory). I suppose the device itself could also access this part of memory through /dev/mem mapped mechanism if it runs Linux too.

One thing a program could do to the (virtual) memory is atomic operations such as "__atomic_sub_fetch" and that could be very useful when writing high performance code.

My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold? I do not know if PCIe can guarantee the atomic-ness considering the atomic operations could come from both the host and the device's CPUs at the same time. If yes, how is its perf compare to the same atomic operation on the regular memory?

I have seen related question asked here, not direct answer. PCI Express BAR memory mapping basic understanding

Thanks a lot!


Solution 1:

OP Question 1: My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold?

  • Yes. Both FPGA and CPU host software can request a lock for exclusive access to a memory region to perform atomic operations. For example, OpenCL shared virtual memory (SVM) introduces fine-grained host-device synchronization, which allows the host and device to access shared data structures concurrently and synchronize at the granularity of atomic load/store instructions. This enables true concurrency between software threads and FPGA kernels in the presence of shared data structures.

  • Having said that, such synchronization for concurrent memory access through atomic load/store operations requires a mechanism to ensure that a CPU or FPGA hardware kernel/accelerator access to shared data is guarded against an interfering access to the same location by the other side until the access has been completed (atomicity of the access).

  • Furthermore the answer on SO here says that PCIe 3.0 does support certain "Locked Transactions".

  • Furthermore, since your question has mentioned FPGA, lets take a concrete example. You can also understand about atomic operation for 7 Series FPGAs Integrated Block for PCI Express v3.3. It mentions the 7 Series FPGAs Integrated Block for PCI Express supports both sending and receiving atomic operations (atomic Ops) as defined in the PCI Express Base Specification v2.1. The specification defines three TLP types that allow advanced synchronization mechanisms amongst multiple producers and/or consumers. The integrated block treats atomic Ops TLPs as Non-Posted Memory Transactions. The three TLP types are:

    • FetchAdd
    • Swap
    • CAS (Compare And Set)

OP Question 2: If yes, how is its perf compare to the same atomic operation on the regular memory?

This depends. One of the significant factors is also the size of the data. For example, in some applications the same atomic operation can perform better on regular memory system if array size is small. On the other hand, the same atomic operation can be better for SVM with larger array sizes. At times in case of SVM achieving equal runtime performance to regular memory can also be considered a performance gain since SVM itself has overheads.