123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348 |
- \documentclass{JINST}
- \usepackage[utf8]{inputenc}
- \usepackage{lineno}
- \usepackage{ifthen}
- \newboolean{draft}
- \setboolean{draft}{true}
- \newcommand{\figref}[1]{Figure~\ref{#1}}
- \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
- \author{N.~Zilio$^b$,
- M.~Weber$^a$\\
- \llap{$^a$}Institute for Data Processing and Electronics,\\
- Karlsruhe Institute of Technology (KIT),\\
- Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany\\
- \llap{$^b$}Somewhere in France eating Pate
- }
- \abstract{%
- Modern physics experiments have reached multi-GB/s data rates. Fast data
- links and high performance computing stages are required for continuous
- acquisition and processing. Because of their intrinsic parallelism and
- % I would remove the computing from here and leave "ideal solution",
- % afterwards we have again computing...
- computational power, GPUs emerged as an ideal computing solution for high
- performance computing applications. To connect a fast data acquisition stage
- with a GPU's processing power, we developed an architecture consisting of a
- FPGA that includes a Direct Memory Access (DMA) engine compatible with the
- Xilinx PCI-Express core, a Linux driver for register access and high-level
- software to manage direct memory transfers using AMD's DirectGMA technology.
- Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
- implementation is suitable for real-time DAQ system applications ranging
- photon science and medical imaging to HEP experiment triggers.
- }
- \begin{document}
- \ifdraft
- \setpagewiselinenumbers
- \linenumbers
- \fi
- \section{Motivation}
- GPU computing has become the main driving force for high performance computing
- due to an unprecedented parallelism and a low cost-benefit factor. GPU
- acceleration has found its way into numerous applications, ranging from
- simulation to image processing. Recent years have also seen an increasing
- interest in GPU-based systems for HEP applications, which require a combination
- of high data rates, high computational power and low latency (\emph{e.g.}
- ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
- PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
- science facilities have become comparable to those traditionally associated with
- HEP.
- In HEP experiments data is acquired by one or more read-out boards and then
- transmitted to GPUs in short bursts or in a continuous streaming mode. With
- expected data rates of several GB/s, the data transmission link between the
- read-out boards and the host system may partially limit the overall system
- performance. In particular, latency becomes the most stringent specification if
- a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
- To address these problems we propose a complete hardware/software stack
- architecture based on our own Direct Memory Access (DMA) design and integration
- of AMD's DirectGMA technology into our processing pipeline. In our solution,
- PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
- host computer. Due to its high bandwidth and modularity, PCIe quickly became the
- commercial standard for connecting high-throughput peripherals such as GPUs or
- solid state disks. Optical PCIe networks have been demonstrated
- % JESUS: time span -> for, point in time -> since ...
- % BUDDHA: Ok boss. I wanted to say "since 10 years ago...", is for ok?
- for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
- as a communication bus over long distances. In particular, in HEP DAQ systems,
- optical links are preferred over electrical ones because of their superior
- radiation hardness, lower power consumption and higher density.
- %% Added some more here, I need better internet to find the correct references
- Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
- PCIe network interface card with NVIDIA's GPUDirect
- integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
- at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
- a commercial PCIe engine. Other solutions achieve higher throughput based on
- Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
- they do not provide support for direct FPGA-GPU communication.
- \section{Architecture}
- DMA data transfers are handled by dedicated hardware, which compared with
- Programmed Input Output (PIO) access, offer lower latency and higher throughput
- at the cost of higher system complexity.
- \begin{figure}[t]
- \centering
- \includegraphics[width=1.0\textwidth]{figures/transf}
- \caption{%
- In a traditional DMA architecture (a), data is first written to the main
- system memory and then sent to the GPUs for final processing. By using
- GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
- GPU's internal memory.
- }
- \label{fig:trad-vs-dgpu}
- \end{figure}
- As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
- data through system main memory by copying data from the FPGA into intermediate
- buffers and then finally into the GPU's main memory. Thus, the total throughput
- of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
- AMD's DirectGMA technologies allow direct communication between GPUs and
- auxiliary devices over the PCIe bus. By combining this technology with a DMA
- data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
- system is reduced and total throughput increased. Moreover, the CPU and main
- system memory are relieved from processing because they are not directly
- involved in the data transfer anymore.
- \subsection{DMA engine implementation on the FPGA}
- We have developed a DMA architecture that minimizes resource utilization while
- maintaining the flexibility of a Scatter-Gather memory
- policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
- IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
- main system memory and GPU memory are both supported. Two FIFOs, with a data
- width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
- the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
- a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
- with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
- configured by the host through PIO registers.
- The physical addresses of the host's memory buffers are stored into an internal
- memory and are dynamically updated by the driver or user, allowing highly
- efficient zero-copy data transfers. The maximum size associated with each
- address is 2 GB.
- \subsection{OpenCL management on host side}
- \label{sec:host}
- On the host side, AMD's DirectGMA technology, an implementation of the
- bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
- the FPGA to GPU memory and from the GPU to the FPGA's control registers.
- \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
- the GPU, the physical bus addresses of the GPU buffers are determined with a call to
- \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
- control register of the FPGA (1). The FPGA then writes data blocks autonomously
- in DMA fashion (2).
- % BUDDHA: This part is not true. We need to always do the handshaking if we transfer
- % more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
- Due to hardware restrictions the largest possible GPU buffer
- sizes are about 95 MB but larger transfers can be achieved using a double
- buffering mechanism. Because the GPU provides a flat memory address space and
- our DMA engine allows multiple destination addresses to be set in advance, we
- can determine all addresses before the actual transfers thus keeping the
- CPU out of the transfer loop.
- %% BUDDHA: the CPU is still involved in the loop at the moment. We didn't manage
- % to move the handshaking completely to the GPU, did we?
- To signal events to the FPGA (4), the control registers can be mapped into the
- GPU's address space passing a special AMD-specific flag and passing the physical
- BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
- function. From the GPU, this memory is seen transparently and as regular GPU
- memory and can be written accordingly (3). Individual write accesses are issued
- as PIO commands, however using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
- call it is also possible to write entire memory regions in a DMA fashion to the
- FPGA. In this case, the GPU acts as bus master to push data to the FPGA.
- \begin{figure}
- \centering
- \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
- \caption{The FPGA writes to GPU memory by mapping the physical address of a
- GPU buffer and initating DMA transfers. Signalling happens in reverse order by
- mapping the FPGA control registers into the address space of the GPU.}
- \label{fig:opencl-setup}
- \end{figure}
- To process the data, we encapsulated the DMA setup and memory mapping in a
- plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
- framework allows for an easy construction of streamed data processing on
- heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
- its specific format and run a Fourier transform on the GPU as well as writing
- back the results to disk, one can run on the command line:
- % BUDDHA: I like this point very very much, formatting helps to make it stand out
- \begin{verbatim}ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
- \end{verbatim} %%
- The framework will take care of scheduling the tasks and distribute the data items
- according. A complementary application programming interface allows users to
- develop custom applications written in C or high-level languages such as Python.
- High throughput is achieved by the combination of fine- and coarse-grained data
- parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
- of threads and by splitting the data stream and feeding individual data items to
- separate GPUs. None of this requires any user intervention and is solely
- determined by the framework in an automatized fashion.
- \section{Results}
- We measured the performance using a Xilinx VC709 evaluation board plugged into a
- desktop PC with an Intel Xeon E5-1630 3.7 GHz processor and an Intel C612
- chipset.
- Due to the size limitation of the DMA buffer as presented in Section
- \ref{sec:host}, we have to copy several sub buffers in order to transfer data
- larger than the maximum transfer size of 95 MB. In \figref{fig:intra-copy}, the
- throughput for a copy from a smaller sized buffer (representing the DMA buffer)
- to a larger buffer is shown. At a block size of about 384 KB, the throughput
- surpasses the maximum possible PCIe bandwidth, thus making a double buffering
- strategy a viable solution for very large data transfers.
- \begin{figure}
- \includegraphics[width=\textwidth]{figures/intra-copy}
- \caption{%
- Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
- (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
- performance for smaller block sizes is caused by the larger amount of
- transfers required to fill the destination buffer. The throughput has been
- estimated using the host side wall clock time. On-GPU data transfer is about
- twice as fast.
- %% BUDDHA: forgive my ignorance: what does it mean "on-gpu"?
- }
- \label{fig:intra-copy}
- \end{figure}
- \subsection{Throughput}
- %% BUDDHA: why do we need to state this thing? High-throughput affects also the
- %% total latency. One can optimize for one or the other probably, but at the moment
- %% we use the same approach, so I would not write this.
- A high throughput is desired for applications in which the FPGA outputs large
- amounts of data and timing is not an issue. This includes fast, high resolution
- photon detectors as used in synchrotron facilities.
- \figref{fig:throughput} shows the memory write throughput for a GPU and the CPU
- For both system and GPU memory, the write performance is primarily limited by
- the PCIe bus. Higher payloads introduce less overhead, thus increasing the net
- bandwidth. Up until 2 MB transfer size, the performance is almost the same,
- after that the GPU transfer shows a slightly better slope. Data transfers larger
- than 1 GB saturate the PCIe bus. \textbf{LR: We should measure the slope for
- different page sizes, I expect the saturation point to change for different
- page sizes}
- \begin{figure}
- \centering
- \includegraphics[width=1.0\textwidth]{figures/throughput}
- \caption{
- Throughput of regular CPU and our GPU DMA data transfer for up to 50 GB of
- data.
- }
- \label{fig:throughput}
- \end{figure}
- \subsection{Latency}
- %% Change the specs for the small crate
- % MV: we never did anything in that regard
- % LR: Nicholas did, and he said there was no difference in FPGA-GPU
- % For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
- % based on XXX and Intel Nano XXXX. The results does not show any significant difference
- % compared to the previous setup, making it a more cost-effective solution.
- \begin{figure}
- \includegraphics[width=\textwidth]{figures/latency-michele}
- \caption{%
- Relative frequency of measured latencies for a single 4 KB packet transfered
- from the GPU to the FPGA.
- }
- \label{fig:intra-copy}
- \end{figure}
- %% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
- %% Here: instead of this useless plot, we can plot the latency vs different data sizes transmitted (from FPGA). It should reach 50% less for large data transfers, even with our current limitation... Maybe we can also try on a normal desktop ?
- \begin{figure}
- \centering
- \includegraphics[width=0.6\textwidth]{figures/latency}
- \caption{%
- For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
- }
- \label{fig:latency}
- \end{figure}
- % In case everything is fine.
- \ref{fig:latency} shows the comparison between the traditional approach and GPU DMA data transfer.
- The total latency is decreased
- The distribution of latency is shown in Figure \ref{fig:latency}.
- %% EMERGENCY TEXT if we don't manage to fix the latency problem
- The round-trip time of a memory read request issued from the CPU to the FPGA is less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the execution of DirectGMA functions.
- \section{Conclusion}
- We developed a complete hardware and software solution that enables DMA
- transfers between FPGA-based readout boards and GPU computing clusters. The net
- throughput is primarily limited by the PCIe bus, reaching 6.x GB/s for a 256 B
- payload. By writing directly into GPU memory instead of routing data through
- system main memory, the overall latency is reduced by a factor of 2. Moreover,
- the solution proposed here allows high performance GPU computing due to
- integration of the DMA transfer setup in our streamed computing framework.
- Integration with different DAQ systems and custom algorithms is therefore
- immediate.
- \subsection{Outlook}
- %Add if we cannot fix latency
- An optimization of the OpenCL code in ongoing, with the help of AMD technical support.
- With a better understanding of the hardware and software aspects of DirectGMA, we expect
- a significant improvement in latency performance.
- Support for NVIDIA's GPUDirect technology is foreseen in the next months to
- lift the limitation of one specific GPU vendor and direct performance comparison will be possible.
- A custom FPGA evaluation board is currently under development in order to
- increase the total throughput. The board mounts a Virtex-7 chip and features 2
- fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
- x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
- single x16 device by using an external PCIe switch. With two cores operating in parallel,
- we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
- \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
- A big house for all these love-lacking protocols.}
- It is our intention to add Infiniband support.
- \textbf{I NEED TO READ
- WHAT ARE THE ADVANTAGES VS PCIe. Update: internet sucks in China.}
- %LR:Here comes the visionary Luigi
- Our goal is to develop a unique hybrid solution, based
- on commercial standards, that includes fast data transmission protocols and a high performance
- GPU computing framework.
- \acknowledgments
- UFO? KSETA? Are you joking? Nope? You have to credit funding? Mafia.
- \bibliographystyle{JHEP}
- \bibliography{literature}
- \end{document}
|