paper.tex 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \usepackage{caption}
  6. \usepackage{subcaption}
  7. \usepackage{textcomp}
  8. \usepackage{booktabs}
  9. \newboolean{draft}
  10. \setboolean{draft}{true}
  11. \newcommand{\figref}[1]{Figure~\ref{#1}}
  12. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  13. \author{
  14. L.~Rota$^a$,
  15. M.~Vogelgesang$^a$,
  16. L.E.~Ardila Perez$^a$,
  17. M.~Balzer$^a$,
  18. M.~Caselle$^a$,
  19. S.~Chilingaryan$^a$,
  20. A.~Kopmann$^a$,
  21. T.~Dritschler$^a$,
  22. M.~Weber$^a$\\
  23. N.~Zilio$^a$,
  24. \llap{$^a$}Institute for Data Processing and Electronics,\\
  25. Karlsruhe Institute of Technology (KIT),\\
  26. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
  27. E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
  28. }
  29. \abstract{%
  30. Modern physics experiments have reached multi-GB/s data rates. Fast data
  31. links and high performance computing stages are required for continuous data
  32. acquisition and processing. Because of their intrinsic parallelism and
  33. computational power, GPUs emerged as an ideal solution to process this data in
  34. high performance computing applications. In this paper we present a
  35. high-throughput platform based on direct FPGA-GPU communication.
  36. The architecture consists of a
  37. Direct Memory Access (DMA) engine compatible with the Xilinx PCI-Express core,
  38. a Linux driver for register access, and high-level software to manage direct
  39. memory transfers using AMD's DirectGMA technology. Preliminary measurements with a Gen3
  40. x8 link show a throughput of up to 6.4 GB/s and a latency of 40 \textmu s.
  41. Our implementation is suitable for real-time DAQ system applications ranging
  42. from photon science and medical imaging to High Energy Physics (HEP) trigger
  43. systems.
  44. }
  45. \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
  46. \begin{document}
  47. \ifdraft
  48. \setpagewiselinenumbers
  49. \linenumbers
  50. \fi
  51. \section{Introduction}
  52. GPU computing has become the main driving force for high performance computing
  53. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  54. acceleration has found its way into numerous applications, ranging from
  55. simulation to image processing. Recent years have also seen an increasing
  56. interest in GPU-based systems for High Energy Physics (HEP) experiments
  57. (\emph{e.g.} ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu},
  58. Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}). In a typical HEP scenario, data
  59. are acquired by back-end readout systems and then transmitted in short bursts or
  60. in a continuous streaming mode to a computing stage.
  61. With expected data rates of several GB/s, the data transmission link may
  62. partially limit the overall system performance. In particular, latency becomes
  63. the most stringent requirement for time-deterministic applications, \emph{e.g.}
  64. in Low/High-level trigger systems. Furthermore, the amount of data produced in
  65. current generation photon science facilities have become comparable to those
  66. traditionally associated with HEP.
  67. Due to its high bandwidth and modularity,
  68. PCIe quickly became the commercial standard for connecting high-throughput
  69. peripherals such as GPUs or solid state disks.Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe as a communication bus over long distances.
  70. Several solutions for direct FPGA/GPU communication based on PCIe are reported
  71. in literature, and all of them are based on NVIDIA's GPUdirect technology.
  72. In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master
  73. during an FPGA-to-GPU data transfer, reading data from the FPGA. This solution
  74. limits the reported bandwidth and latency to 514 MB/s and 40~\textmu s,
  75. respectively.
  76. %LR: FPGA^2 it's the name of their thing...
  77. When the FPGA is used as a master, a higher throughput can be achieved. An
  78. example of this approach is the FPGA\textsuperscript{2}
  79. framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0
  80. data link.
  81. Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
  82. PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
  83. limits the latency performance of the system to a few tens of \textmu s. If only
  84. the FPGA-to-GPU latency is considered, the measured values span between
  85. 1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
  86. bandwidth saturates at 120 MB/s.
  87. Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
  88. of four PCIe 1.0 links~\cite{nieto2015high}.
  89. Their system (as limited by the interconnect) achieves an average throughput of
  90. 870 MB/s with 1 KB block transfers.
  91. In order to achieve the best performance in terms of latency and bandwidth,
  92. we developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.
  93. To process the data, we encapsulated the DMA setup and memory mapping in a
  94. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  95. framework allows for an easy construction of streamed data processing on
  96. heterogeneous multi-GPU systems. However, the framework is based on OpenCL,
  97. and therefore integration with NVIDIA's CUDA functions for GPUDirect technology
  98. is not possible.
  99. We therefore integrated direct FPGA-to-GPU communication into our processing pipeline
  100. using AMD's DirectGMA technology. In this paper we report the performance of our
  101. DMA engine for FPGA-to-CPU communication and the first preliminary results with
  102. DirectGMA technology.
  103. \section{Architecture}
  104. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  105. data through system main memory by copying data from the FPGA into intermediate
  106. buffers and then finally into the GPU's main memory.
  107. Thus, the total throughput and latency of the system is limited by the main
  108. memory bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow
  109. direct communication between GPUs and auxiliary devices over the PCIe bus.
  110. By combining this technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)),
  111. the overall latency of the system is reduced and total throughput increased.
  112. Moreover, the CPU and main system memory are relieved from processing because
  113. they are not directly involved in the data transfer anymore.
  114. \begin{figure}[t]
  115. \centering
  116. \includegraphics[width=1.0\textwidth]{figures/transf}
  117. \caption{%
  118. In a traditional DMA architecture (a), data are first written to the main
  119. system memory and then sent to the GPUs for final processing. By using
  120. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  121. GPU's internal memory.
  122. }
  123. \label{fig:trad-vs-dgpu}
  124. \end{figure}
  125. \subsection{DMA engine implementation on the FPGA}
  126. We have developed a DMA architecture that minimizes resource utilization while
  127. maintaining the flexibility of a Scatter-Gather memory
  128. policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
  129. IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
  130. main system memory and GPU memory are both supported. Two FIFOs, with a data
  131. width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
  132. the custom logic with an input bandwidth of 7.45 GB/s. The user logic and the DMA
  133. engine are configured by the host through PIO registers.
  134. The physical addresses of the host's memory buffers are stored into an internal
  135. memory and are dynamically updated by the driver or user, allowing highly
  136. efficient zero-copy data transfers. The maximum size associated with each
  137. address is 2 GB. The resource utilization
  138. on a Virtex 7 device is reported in \ref{table:utilization}.
  139. \begin{table}[]
  140. \centering
  141. \caption{Resource utilization on}
  142. \label{table:utilization}
  143. \begin{tabular}{@{}llll@{}}
  144. \toprule
  145. Resource & Utilization & Available & Utilization \% \\
  146. \midrule
  147. LUT & 5331 & 433200 & 1.23 \\
  148. LUTRAM & 56 & 174200 & 0.03 \\
  149. FF & 5437 & 866400 & 0.63 \\
  150. BRAM & 20.50 & 1470 & 1.39 \\
  151. \bottomrule
  152. \end{tabular}
  153. \end{table}
  154. \subsection{OpenCL management on host side}
  155. \label{sec:host}
  156. On the host side, AMD's DirectGMA technology, an implementation of the
  157. bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
  158. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  159. \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
  160. the GPU, the physical bus addresses of the GPU buffers are determined with a call to
  161. \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
  162. control register of the FPGA (1). The FPGA then writes data blocks autonomously
  163. in DMA fashion (2).
  164. Due to hardware restrictions the largest possible GPU buffer sizes are about 95
  165. MB but larger transfers can be achieved by using a double buffering mechanism.
  166. Because the GPU provides a flat memory address space and our DMA engine allows
  167. multiple destination addresses to be set in advance, we can determine all
  168. addresses before the actual transfers thus keeping the CPU out of the transfer
  169. loop for data sizes less than 95 MB.
  170. To signal events to the FPGA (4), the control registers can be mapped into the
  171. GPU's address space passing a special AMD-specific flag and passing the physical
  172. BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
  173. function. From the GPU, this memory is seen transparently as regular GPU memory
  174. and can be written accordingly (3). In our setup, trigger registers are used to
  175. notify the FPGA on successful or failed evaluation of the data.
  176. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
  177. to write entire memory regions in DMA fashion to the FPGA.
  178. In this case, the GPU acts as bus master and pushes data to the FPGA.
  179. \begin{figure}
  180. \centering
  181. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  182. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  183. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  184. mapping the FPGA control registers into the address space of the GPU.}
  185. \label{fig:opencl-setup}
  186. \end{figure}
  187. To process the data, we encapsulated the DMA setup and memory mapping in a
  188. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  189. framework allows for an easy construction of streamed data processing on
  190. heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
  191. from its specific data format and run a Fourier transform on the GPU as well as
  192. writing back the results to disk, one can run the following on the command line:
  193. \begin{verbatim}
  194. ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  195. \end{verbatim}
  196. The framework takes care of scheduling the tasks and distributing the data items
  197. to one or more GPUs. High throughput is achieved by the combination of fine-
  198. and coarse-grained data parallelism, \emph{i.e.} processing a single data item
  199. on a GPU using thousands of threads and by splitting the data stream and feeding
  200. individual data items to separate GPUs. None of this requires any user
  201. intervention and is solely determined by the framework in an automatized
  202. fashion. A complementary application programming interface allows users to
  203. develop custom applications written in C or high-level languages such as Python.
  204. \section{Results}
  205. We carried out performance measurements on a machine with an Intel Xeon E5-1630
  206. at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux 3.11.10. The
  207. Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
  208. In case of FPGA-to-CPU data transfers, the software implementation is the one
  209. described in~\cite{rota2015dma}.
  210. \begin{figure}
  211. \centering
  212. \begin{subfigure}[b]{.49\textwidth}
  213. \centering
  214. \includegraphics[width=\textwidth]{figures/throughput}
  215. \caption{%
  216. DMA data transfer throughput.
  217. }
  218. \label{fig:throughput}
  219. \end{subfigure}
  220. \begin{subfigure}[b]{.49\textwidth}
  221. \includegraphics[width=\textwidth]{figures/latency}
  222. \caption{%
  223. Latency distribution.
  224. % for a single 4 KB packet transferred
  225. % from FPGA-to-CPU and FPGA-to-GPU.
  226. }
  227. \label{fig:latency}
  228. \end{subfigure}
  229. \caption{%
  230. Measured results for data transfers from FPGA to main memory
  231. (CPU) and from FPGA to the global GPU memory (GPU).
  232. }
  233. \end{figure}
  234. The measured results for the pure data throughput is shown in
  235. \figref{fig:throughput} for transfers from the FPGA to the system's main memory
  236. as well as to the global memory as explained in \ref{sec:host}. As one can see,
  237. in both cases the write performance is primarily limited by the PCIe bus. Higher
  238. payloads make up for the constant overhead thus increasing the net bandwidth. Up
  239. until 2 MB data transfer size, the throughput to the GPU is approaching slowly
  240. 100 MB/s. From there on, the throughput increases up to 6.4 GB/s when PCIe bus
  241. saturation sets in at about 1 GB data size.
  242. The CPU throughput saturates earlier at about 30 MB but the maximum throughput
  243. is limited to about 6 GB/s losing about 6\% write performance.
  244. We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
  245. system based on an Intel Atom CPU. The results showed no significant difference
  246. compared to the previous setup. Depending on the application and computing
  247. requirements, this result makes smaller acquisition system a cost-effective
  248. alternative to larger workstations.
  249. \begin{figure}
  250. \includegraphics[width=\textwidth]{figures/intra-copy}
  251. \caption{%
  252. Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
  253. (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
  254. performance for smaller block sizes is caused by the larger amount of
  255. transfers required to fill the destination buffer. The throughput has been
  256. estimated using the host side wall clock time. The raw GPU data transfer as
  257. measured per event profiling is about twice as fast.
  258. }
  259. \label{fig:intra-copy}
  260. \end{figure}
  261. In order to write more than the maximum possible transfer size of 95 MB, we
  262. repeatedly wrote to the same sized buffer which is not possible in a real-world
  263. application. As a solution, we motivated the use of multiple copies in Section
  264. \ref{sec:host}. To verify that we can keep up with the incoming data throughput
  265. using this strategy, we measured the data throughput within a GPU by copying
  266. data from a smaller sized buffer representing the DMA buffer to a larger
  267. destination buffer. \figref{fig:intra-copy} shows the measured throughput for
  268. three sizes and an increasing block size. At a block size of about 384 KB, the
  269. throughput surpasses the maximum possible PCIe bandwidth, thus making a double
  270. buffering strategy a viable solution for very large data transfers.
  271. For HEP experiments, low latencies are necessary to react in a reasonable time
  272. frame. In order to measure the latency caused by the communication overhead we
  273. conducted the following protocol: 1) the host issues continuous data transfers
  274. of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
  275. \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
  276. input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
  277. engine thus pushing back the data to the GPU. 3) At some point, the host
  278. enables generation of data different from initial value which also starts an
  279. internal FPGA counter with 4 ns resolution. 4) When the generated data is
  280. received again at the FPGA, the counter is stopped. 5) The host program reads
  281. out the counter values and computes the round-trip latency. The distribution of
  282. 10000 measurements of the one-way latency is shown in \figref{fig:latency}. The
  283. GPU latency has a mean value of 84.38 \textmu s and a standard variation of
  284. 6.34 \textmu s. This is 9.73 \% slower than the CPU latency of 76.89 \textmu s
  285. that was measured using the same driver and measuring procedure. The
  286. non-Gaussian distribution with two distinct peaks indicates a systemic influence
  287. that we cannot control and is most likely caused by the non-deterministic
  288. run-time behaviour of the operating system scheduler.
  289. \section{Conclusion and outlook}
  290. We developed a hardware and software solution that enables DMA
  291. transfers between FPGA-based readout boards and GPU computing clusters.
  292. The software solution that we proposed allows seamless multi-GPU
  293. processing of the incoming data, due to the integration in our streamed computing
  294. framework. This allows straightforward integration with different DAQ systems
  295. and introduction of custom data processing algorithms.
  296. The net throughput is primarily limited by the PCIe bus, reaching 6.4 GB/s
  297. for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
  298. By writing directly into GPU memory instead of routing data through system
  299. main memory, the overall latency can be reduced, thus allowing close massively
  300. parallel computation on GPUs.
  301. Optimization of the GPU DMA interfacing code is ongoing with the help of
  302. technical support by AMD. With a better understanding of the hardware and
  303. software aspects of DirectGMA, we expect a significant improvement in the latency
  304. performance.
  305. In order to increase the total throughput, a custom FPGA evaluation board is
  306. currently under development. The board mounts a Virtex-7 chip and features two
  307. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
  308. x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
  309. a single x16 device by using an external PCIe switch. With two cores operating
  310. in parallel, we foresee an increase in the data throughput by a factor of 2 (as
  311. demonstrated in~\cite{rota2015dma}).
  312. Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
  313. lift the limitation of one specific GPU vendor and compare the performance of hardware by
  314. different vendors.
  315. Further improvements are expected by generalizing the transfer mechanism and
  316. include Infiniband support besides the existing PCIe connection.
  317. %% Where do we get this values? Any reference?
  318. %This allows
  319. %speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
  320. Our goal is to develop a unique hybrid solution, based on commercial standards,
  321. that includes fast data transmission protocols and a high performance GPU
  322. computing framework.
  323. \acknowledgments
  324. This work was partially supported by the German-Russian BMBF funding programme,
  325. grant numbers 05K10CKB and 05K10VKE.
  326. \bibliographystyle{JHEP}
  327. \bibliography{literature}
  328. \end{document}