123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453 |
- \documentclass{JINST}
- \usepackage[utf8]{inputenc}
- \usepackage{lineno}
- \usepackage{ifthen}
- \usepackage{caption}
- \usepackage{subcaption}
- \usepackage{textcomp}
- \usepackage{booktabs}
- \usepackage{floatrow}
- \newfloatcommand{capbtabbox}{table}[][\FBwidth]
- \newboolean{draft}
- \setboolean{draft}{true}
- \newcommand{\figref}[1]{Figure~\ref{#1}}
- \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
- \author{
- L.~Rota$^a$,
- M.~Vogelgesang$^a$,
- L.E.~Ardila Perez$^a$,
- M.~Caselle$^a$,
- S.~Chilingaryan$^a$,
- T.~Dritschler$^a$,
- N.~Zilio$^a$,
- A.~Kopmann$^a$,
- M.~Balzer$^a$,
- M.~Weber$^a$\\
- \llap{$^a$}Institute for Data Processing and Electronics,\\
- Karlsruhe Institute of Technology (KIT),\\
- Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
- E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
- }
- \abstract{%
- Modern physics experiments have reached multi-GB/s data rates. Fast data links
- and high performance computing stages are required for continuous data
- acquisition and processing. Because of their intrinsic parallelism and
- computational power, GPUs emerged as an ideal solution to process this data in
- high performance computing applications. In this paper we present a high-
- throughput platform based on direct FPGA-GPU communication. The architecture
- consists of a Direct Memory Access (DMA) engine compatible with the Xilinx
- PCI-Express core, a Linux driver for register access, and high- level software
- to manage direct memory transfers using AMD's DirectGMA technology.
- Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s for transfers
- to GPU memory and 6.6~GB/s to system memory. We also assessed the possibility
- of using the architecture in low latency systems: preliminary measurements
- show a round-trip latency as low as 1 \textmu s for data transfers to system
- memory, while the additional latency introduced by OpenCL scheduling is the
- current limitation for GPU based systems. Our implementation is suitable for
- real-time DAQ system applications ranging from photon science and medical
- imaging to High Energy Physics (HEP) systems.
- }
- \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
- \begin{document}
- \ifdraft
- \setpagewiselinenumbers
- \linenumbers
- \fi
- \section{Introduction}
- GPU computing has become the main driving force for high performance computing
- due to an unprecedented parallelism and a low cost-benefit factor. GPU
- acceleration has found its way into numerous applications, ranging from
- simulation to image processing.
- The data rates of bio-imaging or beam-monitoring experiments running in current
- generation photon science facilities have reached tens of GB/s~\cite{ufo_camera,
- caselle}. In a typical scenario, data are acquired by back-end readout systems
- and then transmitted in short bursts or continuously streamed to a computing
- stage. In order to collect data over long observation times, the readout
- architecture and the computing stages must be able to sustain high data rates.
- Recent years have also seen an increasing interest in GPU-based systems for High
- Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
- ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and photon
- science experiments. In time-deterministic applications, such as Low/High-level
- trigger systems, latency becomes the most stringent requirement.
- Due to its high bandwidth and modularity, PCIe is the commercial \emph{de facto}
- standard for connecting high-throughput peripherals such as GPUs or solid state
- disks. Moreover, optical PCIe networks have been demonstrated a decade
- ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
- communication link over long distances.
- Several solutions for direct FPGA-GPU communication based on PCIe and NVIDIA's
- proprietary GPUdirect technology are reported in the literature. In the
- implementation of Bittner and Ruf the GPU acts as master during an FPGA-to-GPU
- read data transfer \cite{bittner}. This solution limits the reported bandwidth
- and latency to 514 MB/s and 40~\textmu s, respectively. When the FPGA is used
- as a master, a higher throughput can be achieved. An example of this approach
- is the \emph{FPGA\textsuperscript{2}} framework by Thoma et~al.\cite{thoma},
- which reaches 2454 MB/s using a PCIe 2.0 8x data link. Lonardo et~al.\ achieved
- low latencies with their NaNet design, an FPGA-based PCIe network interface
- card~\cite{lonardo2015nanet}. The Gbe link however limits the latency
- performance of the system to a few tens of \textmu s. If only the FPGA-to-GPU
- latency is considered, the measured values span between 1~\textmu s and
- 6~\textmu s, depending on the datagram size. Nieto et~al.\ presented a system
- based on a PXIexpress data link that makes use of four PCIe 1.0
- links~\cite{nieto2015high}. Their system, as limited by the interconnect,
- achieves an average throughput of 870 MB/s with 1 KB block transfers.
- In order to achieve the best performance in terms of latency and bandwidth, we
- developed a high-performance DMA engine based on Xilinx's PCIe 3.0 Core. To
- process the data, we encapsulated the DMA setup and memory mapping in a plugin
- for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
- framework allows for an easy construction of streamed data processing on
- heterogeneous multi-GPU systems. However, the framework is based on OpenCL and
- cannot be used with GPUDirect technology, which is targeted on CUDA only. To
- overcome this limitation we have evaluated AMD's DirectGMA technology which
- allows us to integrate direct PCIe data transfers with applications developed
- using OpenCL language. In this paper we present the hardware/software
- interface and report the throughput performance of our architecture together
- with preliminary measurements about DirectGMA's applicability in low-latency
- applications.
- %% LR: this part -> OK
- \section{Architecture}
- As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
- data through system main memory by copying data from the FPGA into intermediate
- buffers and then finally into the GPU's main memory. Thus, the total throughput
- and latency of the system is limited by the main memory bandwidth. NVIDIA's
- GPUDirect and AMD's DirectGMA technologies allow direct communication between
- GPUs and auxiliary devices over PCIe. By combining this technology with DMA data
- transfers as shown in \figref{fig:trad-vs-dgpu} (b), the overall latency of the
- system is reduced and total throughput increased. Moreover, the CPU and main
- system memory are relieved from processing because they are not directly
- involved in the data transfer anymore.
- \begin{figure}[t]
- \centering
- \includegraphics[width=1.0\textwidth]{figures/transf}
- \caption{%
- In a traditional DMA architecture (a), data are first written to the main
- system memory and then sent to the GPUs for final processing. By using
- GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
- the GPU's internal memory.
- }
- \label{fig:trad-vs-dgpu}
- \end{figure}
- %% LR: this part -> Text:OK, Figure: must be updated
- \subsection{DMA engine implementation on the FPGA}
- We have developed a DMA engine that minimizes resource utilization while
- maintaining the flexibility of a Scatter-Gather memory
- policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}.
- The engine is compatible with the Xilinx PCIe 2.0/3.0 IP-Core~\cite{xilinxgen3}
- for Xilinx FPGA families 6 and 7. DMA data transfers between main system memory
- and GPU memory are supported. Two FIFOs,operating at 250 MHz and a data width of
- 256 bits, act as user-friendly interfaces with the custom logic at an input
- bandwidth of 7.45 GB/s. The user logic and the DMA engine are configured by the
- host system through PIO registers. The resource utilization on a Virtex 7 device
- is reported in Table~\ref{table:utilization}.
- \begin{figure}[t]
- \begin{floatrow}
- \ffigbox{%
- \includegraphics[width=0.45\textwidth]{figures/fpga-arch}
- }{%
- \caption{Block diagram of the FPGA architecture}%
- \label{fig:fpga-arch}
- }
- \capbtabbox{%
- \begin{tabular}{@{}llll@{}}
- \toprule
- Resource & Utilization & (\%) \\
- \midrule
- LUT & 5331 & (1.23) \\
- LUTRAM & 56 & (0.03) \\
- FF & 5437 & (0.63) \\
- BRAM & 21 & (1.39) \\
- \bottomrule
- \end{tabular}
- }{%
- \caption{Resource utilization on a xc7vx690t-ffg1761 device.}%
- \label{table:utilization}
- }
- \end{floatrow}
- \end{figure}
- The physical addresses of the host's memory buffers are stored into an internal
- memory and are dynamically updated by the driver or user, allowing highly
- efficient zero-copy data transfers. The maximum size associated with each
- address is 2 GB.
- %% LR: -----------------> OK
- \subsection{OpenCL management on host side}
- \label{sec:host}
- \begin{figure}[b]
- \centering
- \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
- \caption{The FPGA writes to GPU memory by mapping the physical address of a
- GPU buffer and initating DMA transfers. Signalling happens in reverse order by
- mapping the FPGA control registers into the address space of the GPU.}
- \label{fig:opencl-setup}
- \end{figure}
- %% Description of figure
- On the host side, AMD's DirectGMA technology, an implementation of the
- bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
- the FPGA to GPU memory and from the GPU to the FPGA's control registers.
- \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
- the GPU, the physical bus addresses of the GPU buffers are determined with a
- call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU
- in a control register of the FPGA (1). The FPGA then writes data blocks
- autonomously in DMA fashion (2). To signal events to the FPGA (4), the control
- registers can be mapped into the GPU's address space passing a special
- AMD-specific flag and passing the physical BAR address of the FPGA configuration
- memory to the \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is
- seen transparently as regular GPU memory and can be written accordingly (3). In
- our setup, trigger registers are used to notify the FPGA on successful or failed
- evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
- call it is possible to write entire memory regions in DMA fashion to the FPGA.
- In this case, the GPU acts as bus master and pushes data to the FPGA.
- %% Double Buffering strategy.
- Due to hardware restrictions with AMD W9100 FirePro cards, the largest possible
- GPU buffer sizes are about 95 MB. However, larger transfers can be achieved by
- using a double buffering mechanism: data are copied from the buffer exposed to
- the FPGA into a different location in GPU memory. To verify that we can keep up
- with the incoming data throughput using this strategy, we measured the data
- throughput within a GPU by copying data from a smaller sized buffer representing
- the DMA buffer to a larger destination buffer. At a block size of about 384 KB
- the throughput surpasses the maximum possible PCIe bandwidth. Block transfers
- larger than 5 MB saturate the bandwidth at 40 GB/s. Double buffering is
- therefore a viable solution for very large data transfers, where throughput
- performance is favoured over latency. For data sizes less than 95 MB, we can
- determine all addresses before the actual transfers thus keeping the CPU out of
- the transfer loop.
- %% Ufo Framework
- To process the data, we encapsulated the DMA setup and memory mapping in a
- plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
- This framework allows for an easy construction of streamed data processing on
- heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
- from its specific data format and run a Fourier transform on the GPU as well as
- writing back the results to disk, one can run the following on the command line:
- \begin{verbatim}
- ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
- \end{verbatim}
- The framework takes care of scheduling the tasks and distributing the data items
- to one or more GPUs. High throughput is achieved by the combination of fine- and
- coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
- GPU using thousands of threads and by splitting the data stream and feeding
- individual data items to separate GPUs. None of this requires any user
- intervention and is solely determined by the framework in an automatized
- fashion. A complementary application programming interface allows users to
- develop custom applications in C or high-level languages such as Python.
- %% --------------------------------------------------------------------------
- \section{Results}
- \begin{table}[b]
- \centering
- \small
- \caption{Setups used for throughput and latency measurements.}
- \label{table:setups}
- \tabcolsep=0.11cm
- \begin{tabular}{@{}llll@{}}
- \toprule
- & Setup 1 (workstation) & Setup 2 (embedded)\\
- \midrule
- CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
- Chipset & Intel C612 & Intel ICH9R Express \\
- GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
- PCIe slot: System memory & x8 Gen3 & x4 Gen1 \\
- PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
- \bottomrule
- \end{tabular}
- \end{table}
- We carried out performance measurements on two different setups in order to
- evaluate two different ways of connecting PCIe devices: same PCIe Root Complex
- (RC) and different RC. In addition, we also want to verify that low-power
- embedded systems can drive DMA transfers if both the FPGA board and the GPU
- are connected using fast PCIe switch. The setups are described in
- table~\ref{table:setups}. In both setups, a Xilinx VC709 evaluation board was
- used. In Setup 1, the FPGA board and the GPU were plugged into a PCIe 3.0
- slot, but they were connected to different RC. In Setup 2, a low-end
- Supermicro X7SPA- HF-D525 system was connected to a Netstor NA255A external
- PCIe enclosure. As opposed to Setup 1, both the FPGA board and the GPU were
- connected to the same RC through a x16 Gen3 PCIe link, available within the
- Netstor box. In case of FPGA-to-CPU data transfers, the software
- implementation is the one described in~\cite{rota2015dma}.
- \subsection{Throughput}
- \begin{figure}[t]
- \includegraphics[width=0.85\textwidth]{figures/throughput}
- \caption{%
- Measured throughput for data transfers from FPGA to main memory
- (CPU) and from FPGA to the global GPU memory (GPU) using Setup 1.
- }
- \label{fig:throughput}
- \end{figure}
- In order to evaluate the maximum performance of the DMA engine, measurements
- of pure data throughput were carried out using Setup 1. The results are shown
- in \figref{fig:throughput} for transfers to the system's main memory as well
- as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
- double buffering mechanism was used. As one can see, in both cases the write
- performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
- size, the throughput to the GPU is slowly approaching 100 MB/s. From there on,
- the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
- throughput saturates earlier at a maximum throughput of 6.6 GB/s. The slope
- and maximum performance depend on the different implementation of the
- handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
- 1.0 link limits the throughput to system main memory to around 700 MB/s.
- However, transfers to GPU memory yielded the same results as Setup 1,
- demonstrating that a high throughput can be achieved even if a slow embedded
- PC is used to drive the communication.
- \subsection{Latency}
- \begin{figure}[t]
- \centering
- \begin{subfigure}[b]{.49\textwidth}
- \centering
- \includegraphics[width=\textwidth]{figures/latency-cpu}
-
- \label{fig:latency-cpu}
- \vspace{-0.4\baselineskip}
- \caption{System memory transfer latency}
- \end{subfigure}
- \begin{subfigure}[b]{.49\textwidth}
- \includegraphics[width=\textwidth]{figures/latency-gpu}
- \label{fig:latency-gpu}
- \vspace{-0.4\baselineskip}
- \caption{GPU memory transfer latency}
- \end{subfigure}
- \caption{%
- Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).
- }
- \label{fig:latency}
- \end{figure}
- We conducted the following test in order to measure the latency introduced by the DMA engine:
- 1) the host starts a DMA transfer by issuing the \emph{start\_dma} command,
- 2) the DMA engine transmits data into the system main memory,
- 3) when all the data has been transferred, the DMA engine notifies the host that
- new data is present by writing into a specific address in the system main
- memory,
- 4) the host acknowledges that data has been received by issuing the \emph{stop\_dma} command.
- A counter on the FPGA measures the time interval between the \emph{start\_dma}
- and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring the
- round-trip latency of the system. The correct ordering of the packets is
- guaranteed by the PCIe protocol. The measured round-trip latencies for data
- transfers to system main memory and GPU memory are shown in
- \figref{fig:latency}.
- With Setup 1 and system memory, latencies as low as 1.1 \textmu s can be
- achieved for a packet size of 1024 B. Higher latencies and a dependency on size
- measured with Setup 2 are caused by the slower PCIe x4 1.0 link connecting the
- FPGA board to the system main memory.
- The same test was performed when transferring data inside GPU memory. Like in
- the previous case, the notification was written into main memory. This
- approach was used because a latency of 100 to 200 \textmu s introduced by
- OpenCL scheduling did not allow a precise measurement based only on
- FPGA-to-GPU communication. When the devices are connected to the same RC, as in Setup 2,
- a latency of 2 \textmu s is achieved and limited by the latency to system main
- memory, as seen in \figref{fig:latency} (a). On the contrary, if the FPGA
- board and the GPU are connected to a different RC as in Setup 1, the latency
- increases significantly with packet size. Connecting devices to the same RC is
- therefore necessary to achieve the best performance, as
- stated in the NVIDIA's GPUDirect documentation. It must be noted that, due to
- the measurement approach, data is trasferred to GPU memory while the
- completion is determined by the packet written into system memory. This
- desynchronizes the transfer in case PCIe buffers are used on receiver side:
- the packet destined to system memory may be written before all data has been
- processed in the PCIe receiving buffer and delivered to GPU memory. The low
- latencies below 1 kB may be attributed to this effect, which must be taken
- into account as it could potentially lead to data corruption.
-
- %%%%%%%%%%%
- As stated in the NVIDIA's GPUDirect documentation, the
- devices must share the same RC to achieve the best performance.
- \section{Conclusion and outlook}
- We developed a hardware and software solution that enables DMA transfers
- between FPGA-based readout systems and GPU computing clusters.
- The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
- for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
- data transfer. The measurements on a low-end system based on an Intel Atom CPU
- showed no significant difference in throughput performance. Depending on the
- application and computing requirements, this result makes smaller acquisition
- system a cost-effective alternative to larger workstations.
- We measured a round-trip latency of 1 \textmu s when transferring data between
- the DMA engine and system memory. We also assessed the applicability of
- DirectGMA in low latency applications: preliminary results shows that latencies
- as low as 2 \textmu s can be achieved during data transfers to GPU memory.
- However, at the time of writing this paper, the latency introduced by OpenCL
- scheduling is in the range of hundreds of \textmu s. In order to lift this
- limitation and make our implementation useful in low-latency applications, we
- are currently optimizing the the GPU-DMA interfacing OpenCL code with the help
- of technical support by AMD. Moreover, measurements show that dedicated
- connecting hardware must be employed in low latency applications.
- In order to increase the total throughput, a custom FPGA evaluation board is
- currently under development. The board mounts a Virtex-7 chip and features two
- fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
- 3.0 connection. Two PCIe x8 3.0 cores, instantiated on the board, will be mapped
- as a single x16 device by using an external PCIe switch. With two cores
- operating in parallel, we foresee an increase in the data throughput by a
- factor of two as demonstrated in~\cite{rota2015dma}.
- The proposed software solution allows seamless multi-GPU processing of the
- incoming data, due to the integration in our streamed computing framework. This
- allows straightforward integration with different DAQ systems and introduction
- of custom data processing algorithms. Support for NVIDIA's GPUDirect technology
- is also foreseen in the next months to lift the limitation of one specific GPU
- vendor and compare the performance of hardware by different vendors. Further
- improvements are expected by generalizing the transfer mechanism and include
- InfiniBand support besides the existing PCIe connection.
- Our goal is to develop a unique hybrid solution, based on commercial standards,
- that includes fast data transmission protocols and a high performance GPU
- computing framework.
- \acknowledgments
- This work was partially supported by the German-Russian BMBF funding programme,
- grant numbers 05K10CKB and 05K10VKE.
- \bibliographystyle{JHEP}
- \bibliography{literature}
- \end{document}
|