|
@@ -33,11 +33,12 @@
|
|
|
performance computing applications. To connect a fast data acquisition stage
|
|
|
with a GPU's processing power, we developed an architecture consisting of a
|
|
|
FPGA that includes a Direct Memory Access (DMA) engine compatible with the
|
|
|
- Xilinx PCI-Express core, a Linux driver for register access and high-level
|
|
|
+ Xilinx PCI-Express core, a Linux driver for register access, and high-level
|
|
|
software to manage direct memory transfers using AMD's DirectGMA technology.
|
|
|
Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
|
|
|
implementation is suitable for real-time DAQ system applications ranging
|
|
|
- photon science and medical imaging to HEP experiment triggers.
|
|
|
+ from photon science and medical imaging to High Energy Physics (HEP)
|
|
|
+ trigger systems.
|
|
|
}
|
|
|
|
|
|
|
|
@@ -70,12 +71,12 @@ performance. In particular, latency becomes the most stringent specification if
|
|
|
a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
|
|
|
|
|
|
To address these problems we propose a complete hardware/software stack
|
|
|
-architecture based on our own Direct Memory Access (DMA) design and integration
|
|
|
+architecture based on our own DMA design, and integration
|
|
|
of AMD's DirectGMA technology into our processing pipeline. In our solution,
|
|
|
PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
|
|
|
host computer. Due to its high bandwidth and modularity, PCIe quickly became the
|
|
|
commercial standard for connecting high-throughput peripherals such as GPUs or
|
|
|
-solid state disks. Optical PCIe networks have been demonstrated
|
|
|
+solid state disks. Moreover, optical PCIe networks have been demonstrated
|
|
|
a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe
|
|
|
as a communication bus over long distances. In particular, in HEP DAQ systems,
|
|
|
optical links are preferred over electrical ones because of their superior
|
|
@@ -274,7 +275,7 @@ conducted the following protocol: 1) the host issues continuous data transfers
|
|
|
of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
|
|
|
\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
|
|
|
input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
|
|
|
-engine thus pushing back the data back to the GPU. 3) At some point, the host
|
|
|
+engine thus pushing back the data to the GPU. 3) At some point, the host
|
|
|
enables generation of data different from initial value which also starts an
|
|
|
internal FPGA counter with 4 ns resolution. 4) When the generated data is
|
|
|
received again at the FPGA, the counter is stopped. 5) The host program reads
|
|
@@ -293,11 +294,11 @@ We developed a complete hardware and software solution that enables DMA
|
|
|
transfers between FPGA-based readout boards and GPU computing clusters with
|
|
|
reasonable performance characteristics. The net throughput is primarily limited
|
|
|
by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
|
|
|
-CPU-based data transfer. Moreover, by writing directly into GPU memory instead
|
|
|
+CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
|
|
|
of routing data through system main memory, the overall latency can be reduced
|
|
|
by a factor of two allowing close massively parallel computation on GPUs.
|
|
|
Moreover, the software solution that we proposed allows seamless multi-GPU
|
|
|
-processing of the incoming data due to the integration in our streamed computing
|
|
|
+processing of the incoming data, due to the integration in our streamed computing
|
|
|
framework. This allows straightforward integration with different DAQ systems
|
|
|
and introduction of custom data processing algorithms.
|
|
|
|
|
@@ -316,7 +317,7 @@ a single x16 device by using an external PCIe switch. With two cores operating
|
|
|
in parallel, we foresee an increase in the data throughput by a factor of 2 (as
|
|
|
demonstrated in~\cite{rota2015dma}). Further improvements are expected by
|
|
|
generalizing the transfer mechanism and include Infiniband support besides the
|
|
|
-existing PCIe connect. This allows speeds of up to 290 Gbit/s and latencies as
|
|
|
+existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
|
|
|
low as 0.5 \textmu s.
|
|
|
|
|
|
Our goal is to develop a unique hybrid solution, based on commercial standards,
|