123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382 |
- \documentclass{JINST}
- \usepackage[utf8]{inputenc}
- \usepackage{lineno}
- \usepackage{ifthen}
- \usepackage{caption}
- \usepackage{subcaption}
- \usepackage{textcomp}
- \newboolean{draft}
- \setboolean{draft}{true}
- \newcommand{\figref}[1]{Figure~\ref{#1}}
- \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
- \author{
- L.~Rota$^a$,
- M.~Vogelgesang$^a$,
- N.~Zilio$^a$,
- M.~Caselle$^a$,
- S.~Chilingaryan$^a$,
- L.E.~Ardila Perez$^a$,
- M.~Balzer$^a$,
- M.~Weber$^a$\\
- \llap{$^a$}Institute for Data Processing and Electronics,\\
- Karlsruhe Institute of Technology (KIT),\\
- Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
- E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
- }
- \abstract{%
- Modern physics experiments have reached multi-GB/s data rates. Fast data
- links and high performance computing stages are required for continuous
- data acquisition and processing. Because of their intrinsic parallelism and
- computational power, GPUs emerged as an ideal solution for high
- performance computing applications. In this paper we present a high-throughput
- platform based on direct FPGA-GPU communication and preliminary
- results are reported.
- The architecture consists of a Direct Memory Access (DMA) engine compatible with the
- Xilinx PCI-Express core, a Linux driver for register access, and high-level
- software to manage direct memory transfers using AMD's DirectGMA technology.
- Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s and a latency of
- XXX \textmu s.
- Our implementation is suitable for real-time DAQ system applications ranging
- from photon science and medical imaging to High Energy Physics (HEP)
- trigger systems.
- }
- \keywords{FPGA; GPU; PCI-Express; openCL; directGPU; directGMA}
- \begin{document}
- \ifdraft
- \setpagewiselinenumbers
- \linenumbers
- \fi
- \section{Introduction}
- GPU computing has become the main driving force for high performance computing
- due to an unprecedented parallelism and a low cost-benefit factor. GPU
- acceleration has found its way into numerous applications, ranging from
- simulation to image processing. Recent years have also seen an increasing
- interest in GPU-based systems for High Energy Physics (HEP) experiments (\emph{e.g.}
- ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
- PANDA~\cite{panda_gpu}). In a typical HEP scenario,
- data are acquired by back-end readout systems and then
- transmitted in short bursts or in a continuous streaming mode to a computing stage.
- With expected data rates of several GB/s, the data transmission link may partially
- limit the overall system performance.
- In particular, latency becomes the most stringent requirement for time-deterministic applications,
- \emph{e.g.} in Low/High-level trigger systems.
- Furthermore, the volumes of data produced in recent photon
- science facilities have become comparable to those traditionally associated with
- HEP.
- In order to achieve the best performance in terms of latency and bandwidth,
- data transfers must be handled by a dedicated DMA controller, at the cost of higher
- system complexity.
- To address these problems we propose a complete hardware/software stack
- architecture based on a high-performance DMA engine implemented on Xilinx FPGAs,
- and integration of AMD's DirectGMA technology into our processing pipeline.
- In our solution, PCI-express (PCIe) has been chosen as a direct data link between
- FPGA boards and the host computer. Due to its high bandwidth and modularity,
- PCIe quickly became the commercial standard for connecting high-throughput
- peripherals such as GPUs or solid state disks.
- Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie},
- opening the possibility of using PCIe as a communication bus over long distances.
- \section{Background}
- Several solutions for direct FPGA/GPU communication are reported in literature,
- and all of them are based on NVIDIA's GPUdirect technology.
- In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master
- during an FPGA-to-GPU data transfer, reading data from the FPGA.
- This solution limits the reported bandwidth and
- latency to, respectively, 514 MB/s and 40~\textmu s.
- When the FPGA is used as a master a higher throughput can be achieved.
- An example of this approach is the FPGA\textsuperscript{2} framework
- by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
- Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
- PCIe network interface card~\cite{lonardo2015nanet}.
- The Gbe link however limits the latency performance of the system to a few tens of $\mu$s.
- If only the FPGA-to-GPU latency is considered, the measured values span between
- 1~$\mu$s and 6~$\mu$s, depending on the datagram size. Moreover,
- the bandwidth saturates at 120 MB/s.
- Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
- of four PCIe 1.0 links~\cite{nieto2015high}.
- Their system (as limited by the interconnect) achieves an average throughput of
- 870 MB/s with 1 KB block transfers.
- \section{Basic Concept}
- \begin{figure}[t]
- \centering
- \includegraphics[width=1.0\textwidth]{figures/transf}
- \caption{%
- In a traditional DMA architecture (a), data are first written to the main
- system memory and then sent to the GPUs for final processing. By using
- GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
- GPU's internal memory.
- }
- \label{fig:trad-vs-dgpu}
- \end{figure}
- As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
- data through system main memory by copying data from the FPGA into intermediate
- buffers and then finally into the GPU's main memory.
- Thus, the total throughput and latency of the system is limited by the main
- memory bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow
- direct communication between GPUs and auxiliary devices over the PCIe bus.
- By combining this technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)),
- the overall latency of the system is reduced and total throughput increased.
- Moreover, the CPU and main system memory are relieved from processing because
- they are not directly involved in the data transfer anymore.
- \subsection{DMA engine implementation on the FPGA}
- We have developed a DMA architecture that minimizes resource utilization while
- maintaining the flexibility of a Scatter-Gather memory
- policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
- IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
- main system memory and GPU memory are both supported. Two FIFOs, with a data
- width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
- the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to
- saturate a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe
- 3.0 x8 link with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA
- engine are configured by the host through PIO registers.
- The physical addresses of the host's memory buffers are stored into an internal
- memory and are dynamically updated by the driver or user, allowing highly
- efficient zero-copy data transfers. The maximum size associated with each
- address is 2 GB. The engine fully supports 64-bit addresses. The resource utilization
- on a Virtex 7 device is reported in \ref{table:utilization}.
- % Please add the following required packages to your document preamble:
- % \usepackage{booktabs}
- \begin{table}[]
- \centering
- \caption{Resource utilization on}
- \label{table:utilization}
- \begin{tabular}{@{}llll@{}}
- Resource & Utilization & Available & Utilization \% \\\hline
- LUT & 5331 & 433200 & 1.23 \\
- LUTRAM & 56 & 174200 & 0.03 \\
- FF & 5437 & 866400 & 0.63 \\
- BRAM & 20.50 & 1470 & 1.39 \\\hline
- \end{tabular}
- \end{table}
- \subsection{OpenCL management on host side}
- \label{sec:host}
- On the host side, AMD's DirectGMA technology, an implementation of the
- bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
- the FPGA to GPU memory and from the GPU to the FPGA's control registers.
- \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
- the GPU, the physical bus addresses of the GPU buffers are determined with a call to
- \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
- control register of the FPGA (1). The FPGA then writes data blocks autonomously
- in DMA fashion (2).
- Due to hardware restrictions the largest possible GPU buffer sizes are about 95
- MB but larger transfers can be achieved by using a double buffering mechanism.
- Because the GPU provides a flat memory address space and our DMA engine allows
- multiple destination addresses to be set in advance, we can determine all
- addresses before the actual transfers thus keeping the CPU out of the transfer
- loop for data sizes less than 95 MB.
- To signal events to the FPGA (4), the control registers can be mapped into the
- GPU's address space passing a special AMD-specific flag and passing the physical
- BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
- function. From the GPU, this memory is seen transparently as regular GPU memory
- and can be written accordingly (3). In our setup, trigger registers are used to
- notify the FPGA on successful or failed evaluation of the data.
- Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
- to write entire memory regions in DMA fashion to the FPGA.
- In this case, the GPU acts as bus master and pushes data to the FPGA.
- \begin{figure}
- \centering
- \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
- \caption{The FPGA writes to GPU memory by mapping the physical address of a
- GPU buffer and initating DMA transfers. Signalling happens in reverse order by
- mapping the FPGA control registers into the address space of the GPU.}
- \label{fig:opencl-setup}
- \end{figure}
- To process the data, we encapsulated the DMA setup and memory mapping in a
- plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
- framework allows for an easy construction of streamed data processing on
- heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
- from its specific data format and run a Fourier transform on the GPU as well as
- writing back the results to disk, one can run the following on the command line:
- \begin{verbatim}
- ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
- \end{verbatim}
- The framework takes care of scheduling the tasks and distributing the data items
- to one or more GPUs. High throughput is achieved by the combination of fine-
- and coarse-grained data parallelism, \emph{i.e.} processing a single data item
- on a GPU using thousands of threads and by splitting the data stream and feeding
- individual data items to separate GPUs. None of this requires any user
- intervention and is solely determined by the framework in an automatized
- fashion. A complementary application programming interface allows users to
- develop custom applications written in C or high-level languages such as Python.
- \section{Results}
- We carried out performance measurements on a machine with an Intel Xeon E5-1630
- at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux 3.11.10. The
- Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
- \begin{figure}
- \centering
- \begin{subfigure}[b]{.49\textwidth}
- \centering
- \includegraphics[width=\textwidth]{figures/throughput}
- \caption{%
- DMA data transfer throughput.
- }
- \label{fig:throughput}
- \end{subfigure}
- \begin{subfigure}[b]{.49\textwidth}
- \includegraphics[width=\textwidth]{figures/latency}
- \caption{%
- Latency distribution.
- % for a single 4 KB packet transferred
- % from FPGA to CPU and FPGA to GPU.
- }
- \label{fig:latency}
- \end{subfigure}
- \caption{%
- Measured results for data transfers from FPGA to main memory
- (CPU) and from FPGA to the global GPU memory (GPU).
- }
- \end{figure}
- The measured results for the pure data throughput is shown in
- \figref{fig:throughput} for transfers from the FPGA to the system's main memory
- as well as to the global memory as explained in \ref{sec:host}. As one can see,
- in both cases the write performance is primarily limited by the PCIe bus. Higher
- payloads make up for the constant overhead thus increasing the net bandwidth. Up
- until 2 MB data transfer size, the throughput to the GPU is approaching slowly
- 100 MB/s. From there on, the throughput increases up to 6.4 GB/s when PCIe bus
- saturation sets in at about 1 GB data size.
- The CPU throughput saturates earlier at about 30 MB but the maximum throughput
- is limited to about 6 GB/s losing about 6\% write performance.
- We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
- system based on an Intel Atom CPU. The results showed no significant difference
- compared to the previous setup. Depending on the application and computing
- requirements, this result makes smaller acquisition system a cost-effective
- alternative to larger workstations.
- \begin{figure}
- \includegraphics[width=\textwidth]{figures/intra-copy}
- \caption{%
- Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
- (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
- performance for smaller block sizes is caused by the larger amount of
- transfers required to fill the destination buffer. The throughput has been
- estimated using the host side wall clock time. The raw GPU data transfer as
- measured per event profiling is about twice as fast.
- }
- \label{fig:intra-copy}
- \end{figure}
- In order to write more than the maximum possible transfer size of 95 MB, we
- repeatedly wrote to the same sized buffer which is not possible in a real-world
- application. As a solution, we motivated the use of multiple copies in Section
- \ref{sec:host}. To verify that we can keep up with the incoming data throughput
- using this strategy, we measured the data throughput within a GPU by copying
- data from a smaller sized buffer representing the DMA buffer to a larger
- destination buffer. \figref{fig:intra-copy} shows the measured throughput for
- three sizes and an increasing block size. At a block size of about 384 KB, the
- throughput surpasses the maximum possible PCIe bandwidth, thus making a double
- buffering strategy a viable solution for very large data transfers.
- For HEP experiments, low latencies are necessary to react in a reasonable time
- frame. In order to measure the latency caused by the communication overhead we
- conducted the following protocol: 1) the host issues continuous data transfers
- of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
- \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
- input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
- engine thus pushing back the data to the GPU. 3) At some point, the host
- enables generation of data different from initial value which also starts an
- internal FPGA counter with 4 ns resolution. 4) When the generated data is
- received again at the FPGA, the counter is stopped. 5) The host program reads
- out the counter values and computes the round-trip latency. The distribution of
- 10000 measurements of the one-way latency is shown in \figref{fig:latency}. The
- GPU latency has a mean value of 84.38 \textmu s and a standard variation of
- 6.34 \textmu s. This is 9.73 \% slower than the CPU latency of 76.89 \textmu s
- that was measured using the same driver and measuring procedure. The
- non-Gaussian distribution with two distinct peaks indicates a systemic influence
- that we cannot control and is most likely caused by the non-deterministic
- run-time behaviour of the operating system scheduler.
- \section{Conclusion and outlook}
- We developed a complete hardware and software solution that enables DMA
- transfers between FPGA-based readout boards and GPU computing clusters.
- The net throughput is primarily limited
- by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
- CPU-based data transfer.
- By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
- by a factor of two allowing close massively parallel computation on GPUs.
- Moreover, the software solution that we proposed allows seamless multi-GPU
- processing of the incoming data, due to the integration in our streamed computing
- framework. This allows straightforward integration with different DAQ systems
- and introduction of custom data processing algorithms.
- Optimization of the GPU DMA interfacing code is ongoing with the help of
- technical support by AMD. With a better understanding of the hardware and software aspects of
- DirectGMA, we expect a significant improvement in latency performance. Support
- for NVIDIA's GPUDirect technology is foreseen in the next months to lift the
- limitation of one specific GPU vendor and compare the performance of hardware by
- different vendors.
- In order to increase the total throughput, a custom FPGA evaluation board is
- currently under development. The board mounts a Virtex-7 chip and features two
- fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
- x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
- a single x16 device by using an external PCIe switch. With two cores operating
- in parallel, we foresee an increase in the data throughput by a factor of 2 (as
- demonstrated in~\cite{rota2015dma}).
- %% Where do we get this values? Any reference?
- Further improvements are expected by
- generalizing the transfer mechanism and include Infiniband support besides the
- existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
- low as 0.5 \textmu s.
- Our goal is to develop a unique hybrid solution, based on commercial standards,
- that includes fast data transmission protocols and a high performance GPU
- computing framework.
- \acknowledgments
- This work was partially supported by the German-Russian BMBF funding programme,
- grant numbers 05K10CKB and 05K10VKE.
- \bibliographystyle{JHEP}
- \bibliography{literature}
- \end{document}
|