|
@@ -19,22 +19,22 @@
|
|
|
}
|
|
|
|
|
|
\abstract{%
|
|
|
- %% Old
|
|
|
- \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
|
|
|
- data links.}
|
|
|
+ %% Old
|
|
|
+ % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
|
|
|
+ % data links.}
|
|
|
%proposal for new abstract, including why do we need GPUs
|
|
|
- Modern physics experiments have reached multi-GB/s data rates.
|
|
|
- Fast data links and high performance computing stages are required
|
|
|
- to enable continuous acquisition. Because of their intrinsic parallelism
|
|
|
- and high computational power, GPUs emerged as the ideal computing solution.
|
|
|
- We developed an architecture consisting of a hardware/software
|
|
|
- stack which includes a Direct Memory Access (DMA) engine compatible with the Xilinx
|
|
|
- PCI-Express core and an accompanying Linux driver. Measurements with a
|
|
|
- Gen3 x8 link show a throughput of up to 6.7 GB/s. We also enable direct
|
|
|
- communication between FPGA-based DAQ electronics and second-level data
|
|
|
- processing GPUs via AMD DirectGMA technology. Our implementation finds its
|
|
|
- application in real-time DAQ systems for photon science and medical imaging
|
|
|
- and in triggers for HEP experiments.
|
|
|
+ Modern physics experiments have reached multi-GB/s data rates. Fast data
|
|
|
+ links and high performance computing stages are required for continuous
|
|
|
+ acquisition and processing. Because of their intrinsic parallelism and
|
|
|
+ computational power, GPUs emerged as an ideal computing solution for high
|
|
|
+ performance computing applications. To connect a fast data acquisition stage
|
|
|
+ with a GPU's processing power, we developed an architecture consisting of a
|
|
|
+ FPGA that includes a Direct Memory Access (DMA) engine compatible with the
|
|
|
+ Xilinx PCI-Express core, a Linux driver for register access and high-level
|
|
|
+ software to manage direct memory transfers using AMD's DirectGMA technology.
|
|
|
+ Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
|
|
|
+ implementation is suitable for real-time DAQ system applications ranging
|
|
|
+ photon science and medical imaging to HEP experiment triggers.
|
|
|
}
|
|
|
|
|
|
|
|
@@ -48,7 +48,7 @@
|
|
|
|
|
|
\section{Motivation}
|
|
|
|
|
|
-GPU computing has become the main driving force for high-performance computing
|
|
|
+GPU computing has become the main driving force for high performance computing
|
|
|
due to an unprecedented parallelism and a low cost-benefit factor. GPU
|
|
|
acceleration has found its way into numerous applications, ranging from
|
|
|
simulation to image processing. Recent years have also seen an increasing
|
|
@@ -59,7 +59,7 @@ PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
|
|
|
science facilities have become comparable to those traditionally associated with
|
|
|
HEP.
|
|
|
|
|
|
-In such experiments data is acquired by one or more read-out boards and then
|
|
|
+In HEP experiments data is acquired by one or more read-out boards and then
|
|
|
transmitted to GPUs in short bursts or in a continuous streaming mode. With
|
|
|
expected data rates of several GB/s, the data transmission link between the
|
|
|
read-out boards and the host system may partially limit the overall system
|
|
@@ -67,31 +67,33 @@ performance. In particular, latency becomes the most stringent specification if
|
|
|
a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
|
|
|
|
|
|
To address these problems we propose a complete hardware/software stack
|
|
|
-architecture based on our own DMA design and integration of AMD's DirectGMA
|
|
|
-technology into our processing pipeline. In our solution, PCIe has been chosen
|
|
|
-as data link between FPGA boards and external computing. Due to its high
|
|
|
-bandwidth and modularity, it quickly became the commercial standard for
|
|
|
-connecting high-performance peripherals such as GPUs or solid state disks.
|
|
|
-Optical PCIe networks have been demonstrated since nearly a
|
|
|
-decade~\cite{optical_pcie}, opening the possibility of using PCIe as a
|
|
|
-communication bus over long distances. In particular, in HEP DAQ systems,
|
|
|
+architecture based on our own Direct Memory Access (DMA) design and integration
|
|
|
+of AMD's DirectGMA technology into our processing pipeline. In our solution,
|
|
|
+PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
|
|
|
+host computer. Due to its high bandwidth and modularity, PCIe quickly became the
|
|
|
+commercial standard for connecting high-throughput peripherals such as GPUs or
|
|
|
+solid state disks. Optical PCIe networks have been demonstrated
|
|
|
+% JESUS: time span -> for, point in time -> since ...
|
|
|
+for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
|
|
|
+as a communication bus over long distances. In particular, in HEP DAQ systems,
|
|
|
optical links are preferred over electrical ones because of their superior
|
|
|
radiation hardness, lower power consumption and higher density.
|
|
|
|
|
|
%% Added some more here, I need better internet to find the correct references
|
|
|
Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
|
|
|
-PCIe network interface card with NVIDIA's GPUdirect integration~\cite{lonardo2015nanet}.
|
|
|
-Due to its design, the bandwidth saturates at 120 MB/s for a 1472 byte large UDP
|
|
|
-datagram. Moreover, the system is based on a commercial PCIe engine.
|
|
|
-Other solutions achieve higher throughput based on Xilinx (CITE TWEPP DMA WURTT??)
|
|
|
-or Altera devices (CITENICHOLASPAPER TNS), but they do not provide support for direct
|
|
|
-FPGA-GPU communication.
|
|
|
+PCIe network interface card with NVIDIA's GPUDirect
|
|
|
+integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
|
|
|
+at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
|
|
|
+a commercial PCIe engine. Other solutions achieve higher throughput based on
|
|
|
+Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
|
|
|
+they do not provide support for direct FPGA-GPU communication.
|
|
|
+
|
|
|
|
|
|
\section{Architecture}
|
|
|
|
|
|
-Direct Memory Access (DMA) data transfers are handled by dedicated hardware,
|
|
|
-and, when compared with Programmed Input Output (PIO) access, they offer
|
|
|
-lower latency and higher throughput at the cost of a higher system's complexity.
|
|
|
+DMA data transfers are handled by dedicated hardware, which compared with
|
|
|
+Programmed Input Output (PIO) access, offer lower latency and higher throughput
|
|
|
+at the cost of higher system complexity.
|
|
|
|
|
|
|
|
|
\begin{figure}[t]
|
|
@@ -106,49 +108,59 @@ lower latency and higher throughput at the cost of a higher system's complexity.
|
|
|
\label{fig:trad-vs-dgpu}
|
|
|
\end{figure}
|
|
|
|
|
|
-As shown in \figref{fig:trad-vs-dgpu}, traditional FPGA-GPU architectures
|
|
|
-route data through system main memory. The main memory is involved in a certain
|
|
|
-number of read/write operations, depending on the specific implementation.
|
|
|
-The total throughput of the system is therefore limited by the main memory
|
|
|
-bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technlogies allow direct
|
|
|
-communication between different devices over the PCIe bus.
|
|
|
-By combining this technology with a DMA data
|
|
|
-transfer, the overall latency of the system is reduced and total throughput
|
|
|
-increased. Moreover, the CPU and main system memory are relieved because they
|
|
|
-are not directly involved in the data transfer anymore.
|
|
|
+As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
|
|
|
+data through system main memory by copying data from the FPGA into intermediate
|
|
|
+buffers and then finally into the GPU's main memory. Thus, the total throughput
|
|
|
+of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
|
|
|
+AMD's DirectGMA technologies allow direct communication between GPUs and
|
|
|
+auxiliary devices over the PCIe bus. By combining this technology with a DMA
|
|
|
+data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
|
|
|
+system is reduced and total throughput increased. Moreover, the CPU and main
|
|
|
+system memory are relieved from processing because they are not directly
|
|
|
+involved in the data transfer anymore.
|
|
|
+
|
|
|
|
|
|
-\subsection{Implementation of the DMA engine on FPGA}
|
|
|
+\subsection{DMA engine implementation on the FPGA}
|
|
|
|
|
|
We have developed a DMA architecture that minimizes resource utilization while
|
|
|
maintaining the flexibility of a Scatter-Gather memory
|
|
|
policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
|
|
|
-IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to main
|
|
|
-system memory and GPU memory are both supported. Two FIFOs, with a data width of 256
|
|
|
-bits and operating at 250 MHz, act as user-friendly interfaces with the custom logic.
|
|
|
-The resulting input bandwidth is 7.8 GB/s, enough to saturate a PCIe Gen3 x8
|
|
|
-link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link with a payload
|
|
|
-of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are configured by the
|
|
|
-host through PIO registers.
|
|
|
+IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
|
|
|
+main system memory and GPU memory are both supported. Two FIFOs, with a data
|
|
|
+width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
|
|
|
+the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
|
|
|
+a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
|
|
|
+with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
|
|
|
+configured by the host through PIO registers.
|
|
|
|
|
|
The physical addresses of the host's memory buffers are stored into an internal
|
|
|
-memory and are dynamically updated by the driver, allowing highly efficient
|
|
|
-zero-copy data transfers. The maximum size associated with each address is 2 GB.
|
|
|
+memory and are dynamically updated by the driver or user, allowing highly
|
|
|
+efficient zero-copy data transfers. The maximum size associated with each
|
|
|
+address is 2 GB.
|
|
|
|
|
|
|
|
|
\subsection{OpenCL management on host side}
|
|
|
\label{sec:host}
|
|
|
|
|
|
On the host side, AMD's DirectGMA technology, an implementation of the
|
|
|
-bus-addressable memory extension for OpenCL 1.1 and later, is used to write to
|
|
|
-GPU memory from the FPGA and FPGA control registers from the GPU.
|
|
|
+bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
|
|
|
+the FPGA to GPU memory and from the GPU to the FPGA's control registers.
|
|
|
\figref{fig:opencl-setup} illustrates the main mode of operation: To write into
|
|
|
the GPU, the physical bus address of the GPU buffer is determined with a call to
|
|
|
-\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set in a control register
|
|
|
-of the FPGA (1). The FPGA then writes data blocks autonomously in DMA fashion
|
|
|
-(2). Due to hardware restrictions the largest possible GPU buffer sizes are
|
|
|
-about 95 MB but larger transfers can be achieved using a double buffering
|
|
|
-mechanism. To signal events to the FPGA (4), the control registers are mapped
|
|
|
-into the GPU's address space and seen as regular memory (3).
|
|
|
+\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
|
|
|
+control register of the FPGA (1). The FPGA then writes data blocks autonomously
|
|
|
+in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
|
|
|
+sizes are about 95 MB but larger transfers can be achieved using a double
|
|
|
+buffering mechanism. Because the GPU provides a flat memory address space and
|
|
|
+our DMA engine allows multiple destination addresses to be set in advance, we
|
|
|
+can determine all addresses before the actual transfers thus keeping the
|
|
|
+CPU out of the transfer loop.
|
|
|
+
|
|
|
+To signal events to the FPGA (4), the control registers can be mapped into the
|
|
|
+GPU's address space passing a special AMD-specific flag and passing the physical
|
|
|
+BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
|
|
|
+function. From the GPU, this memory is seen transparently and as regular GPU
|
|
|
+memory and can be written accordingly (3).
|
|
|
|
|
|
\begin{figure}
|
|
|
\centering
|
|
@@ -163,16 +175,17 @@ To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
|
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
|
|
|
-it and run a Fourier transform on the GPU and write back the results, one can
|
|
|
-run \verb|ufo-launch direct-gma ! decode ! fft ! write filename=out.raw| on the
|
|
|
-command line. The accompanying framework application programming interface
|
|
|
-allows for straightforward integration into custom applications written in C or
|
|
|
-high-level languages such as Python. High throughput is achieved by the
|
|
|
-combination of fine- and coarse-grained data parallelism, \emph{i.e.} processing
|
|
|
-a single data item on a GPU using thousands of threads and by splitting the data
|
|
|
-stream and feeding individual data items to separate GPUs. None of this requires
|
|
|
-any user intervention and is solely determined by the framework in an
|
|
|
-automatized fashion.
|
|
|
+its specific format and run a Fourier transform on the GPU as well as writing
|
|
|
+back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
|
|
|
+fft ! write filename=out.raw} on the command line. The framework will take care
|
|
|
+of scheduling the tasks and distribute the data items according. A
|
|
|
+complementary application programming interface allows users to develop custom
|
|
|
+applications written in C or high-level languages such as Python. High
|
|
|
+throughput is achieved by the combination of fine- and coarse-grained data
|
|
|
+parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
|
|
|
+of threads and by splitting the data stream and feeding individual data items to
|
|
|
+separate GPUs. None of this requires any user intervention and is solely
|
|
|
+determined by the framework in an automatized fashion.
|
|
|
|
|
|
|
|
|
\section{Results}
|
|
@@ -226,7 +239,7 @@ compared to the previous setup, making it a more cost-effective solution.
|
|
|
\includegraphics[width=0.6\textwidth]{figures/through_plot}
|
|
|
\caption{
|
|
|
Writing from the FPGA to either system or GPU memory is primarily limited by
|
|
|
- the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth.
|
|
|
+ the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth.
|
|
|
Up until 2 MB transfer size, the performance is almost the
|
|
|
same, after that the GPU transfer shows a slightly better slope. Data
|
|
|
transfers larger than 1 GB saturate the PCIe bus.
|
|
@@ -244,7 +257,7 @@ compared to the previous setup, making it a more cost-effective solution.
|
|
|
\includegraphics[width=0.6\textwidth]{figures/latency}
|
|
|
\caption{%
|
|
|
The data transmission latency is decreased by XXX percent with respect to the traditional
|
|
|
- approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
|
|
|
+ approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
|
|
|
for a 4k pac
|
|
|
}
|
|
|
\label{fig:latency}
|
|
@@ -258,7 +271,7 @@ between FPGA-based readout boards and GPU computing clusters. The net throughput
|
|
|
limited by the PCIe bus, reaching 6.7 GB/s for a 256 B payload. By writing directly into GPU
|
|
|
memory instead of routing data through system main memory, latency is reduced by a factor of 2.
|
|
|
The solution here proposed allows high performance GPU computing thanks to the support of the
|
|
|
-framework. Integration with different DAQ systems and custom algorithms is therefore immediate.
|
|
|
+framework. Integration with different DAQ systems and custom algorithms is therefore immediate.
|
|
|
|
|
|
|
|
|
\subsection{Outlook}
|
|
@@ -270,18 +283,18 @@ A custom FPGA evaluation board is currently under development in order to
|
|
|
increase the total throughput. The board mounts a Virtex-7 chip and features 2
|
|
|
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
|
|
|
x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
|
|
|
-single x16 device by using an external PCIe switch. With two cores operating in parallel,
|
|
|
-we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
|
|
|
+single x16 device by using an external PCIe switch. With two cores operating in parallel,
|
|
|
+we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
|
|
|
|
|
|
\textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
|
|
|
A big house for all these love-lacking protocols.}
|
|
|
|
|
|
-It is our intention to add Infiniband support. I NEED TO READ
|
|
|
+It is our intention to add Infiniband support. I NEED TO READ
|
|
|
WHAT ARE THE ADVANTAGES VS PCIe.
|
|
|
|
|
|
\textbf{LR:Here comes the visionary Luigi...}
|
|
|
Our goal is to develop a unique hybrid solution, based
|
|
|
-on commercial standards, that includes fast data transmission protocols and a high performance
|
|
|
+on commercial standards, that includes fast data transmission protocols and a high performance
|
|
|
GPU computing framework.
|
|
|
|
|
|
\acknowledgments
|