paper.tex 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \usepackage{caption}
  6. \usepackage{subcaption}
  7. \usepackage{textcomp}
  8. \usepackage{booktabs}
  9. \newboolean{draft}
  10. \setboolean{draft}{true}
  11. \newcommand{\figref}[1]{Figure~\ref{#1}}
  12. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  13. \author{
  14. L.~Rota$^a$,
  15. M.~Vogelgesang$^a$,
  16. L.E.~Ardila Perez$^a$,
  17. M.~Caselle$^a$,
  18. S.~Chilingaryan$^a$,
  19. T.~Dritschler$^a$,
  20. N.~Zilio$^a$,
  21. A.~Kopmann$^a$,
  22. M.~Balzer$^a$,
  23. M.~Weber$^a$\\
  24. \llap{$^a$}Institute for Data Processing and Electronics,\\
  25. Karlsruhe Institute of Technology (KIT),\\
  26. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
  27. E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
  28. }
  29. \abstract{% Modern physics experiments have reached multi-GB/s data rates.
  30. Fast data links and high performance computing stages are required for
  31. continuous data acquisition and processing. Because of their intrinsic
  32. parallelism and computational power, GPUs emerged as an ideal solution to
  33. process this data in high performance computing applications. In this paper
  34. we present a high-throughput platform based on direct FPGA-GPU
  35. communication. The architecture consists of a Direct Memory Access (DMA)
  36. engine compatible with the Xilinx PCI-Express core, a Linux driver for
  37. register access, and high-level software to manage direct memory transfers
  38. using AMD's DirectGMA technology. Measurements with a Gen\,3\,x8 link show a
  39. throughput of up to 6.4 GB/s. We also evaluated DirectGMA performance for low
  40. latency applications: preliminary results show a round-trip latency of 2
  41. \textmu s for data sizes up to 4 kB. Our implementation is suitable for real-
  42. time DAQ system applications ranging from photon science and medical imaging
  43. to High Energy Physics (HEP) trigger systems. }
  44. \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
  45. \begin{document}
  46. \ifdraft
  47. \setpagewiselinenumbers
  48. \linenumbers
  49. \fi
  50. \section{Introduction}
  51. GPU computing has become the main driving force for high performance computing
  52. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  53. acceleration has found its way into numerous applications, ranging from
  54. simulation to image processing. Recent years have also seen an increasing
  55. interest in GPU-based systems for High Energy Physics (HEP) (\emph{e.g.}
  56. ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
  57. PANDA~\cite{panda_gpu}) and photon science experiments.
  58. In a typical scenario, data are acquired by back-end readout systems and then
  59. transmitted in short bursts or in a continuous streaming mode to a computing
  60. stage.
  61. The data rates of bio-imaging or beam-monitoring experiments running in
  62. current generation photon science facilities have reached tens of
  63. GB/s~\cite{panda_gpu, atlas_gpu}. In order to collect data over long
  64. observation times, the readout architecture must be able to save. The
  65. throughput data transmission link may partially limit the overall system
  66. performance.
  67. Latency becomes the most stringent requirement for time-deterministic
  68. applications, \emph{e.g.} in Low/High-level trigger systems.
  69. Due to its high bandwidth and modularity, PCIe quickly became the commercial
  70. standard for connecting high-throughput peripherals such as GPUs or solid
  71. state disks. Moreover, optical PCIe networks have been demonstrated a decade
  72. ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
  73. communication link over long distances.
  74. Several solutions for direct FPGA/GPU communication based on PCIe are reported
  75. in literature, and all of them are based on NVIDIA's GPUdirect technology. In
  76. the implementation of bittnerner and Ruf ~\cite{bittner} the GPU acts as
  77. master during an FPGA-to-GPU data transfer, reading data from the FPGA. This
  78. solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
  79. s, respectively.
  80. %LR: FPGA^2 it's the name of their thing...
  81. %MV: best idea in the world :)
  82. When the FPGA is used as a master, a higher throughput can be achieved. An
  83. example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
  84. et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
  85. Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
  86. PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
  87. limits the latency performance of the system to a few tens of \textmu s. If
  88. only the FPGA-to-GPU latency is considered, the measured values span between
  89. 1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
  90. bandwidth saturates at 120 MB/s. Nieto et~al.\ presented a system based on a
  91. PXIexpress data link that makes use of four PCIe 1.0
  92. links~\cite{nieto2015high}. Their system (as limited by the interconnect)
  93. achieves an average throughput of 870 MB/s with 1 KB block transfers.
  94. In order to achieve the best performance in terms of latency and bandwidth, we
  95. developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.To
  96. process the data, we encapsulated the DMA setup and memory mapping in a plugin
  97. for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  98. framework allows for an easy construction of streamed data processing on
  99. heterogeneous multi-GPU systems. The framework is based on OpenCL, and
  100. integration with NVIDIA's CUDA functions for GPUDirect technology is not
  101. possible. We therefore integrated direct FPGA-to-GPU communication into our
  102. processing pipeline using AMD's DirectGMA technology. In this paper we report
  103. the performance of our DMA engine for FPGA-to-CPU communication and some
  104. preliminary measurements about DirectGMA's performance in low-latency applications.
  105. \section{Architecture}
  106. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  107. data through system main memory by copying data from the FPGA into
  108. intermediate buffers and then finally into the GPU's main memory. Thus, the
  109. total throughput and latency of the system is limited by the main memory
  110. bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
  111. communication between GPUs and auxiliary devices over PCIe. By combining this
  112. technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
  113. overall latency of the system is reduced and total throughput increased.
  114. Moreover, the CPU and main system memory are relieved from processing because
  115. they are not directly involved in the data transfer anymore.
  116. \begin{figure}[t]
  117. \centering
  118. \includegraphics[width=1.0\textwidth]{figures/transf}
  119. \caption{%
  120. In a traditional DMA architecture (a), data are first written to the main
  121. system memory and then sent to the GPUs for final processing. By using
  122. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  123. GPU's internal memory.
  124. }
  125. \label{fig:trad-vs-dgpu}
  126. \end{figure}
  127. \subsection{DMA engine implementation on the FPGA}
  128. We have developed a DMA architecture that minimizes resource utilization while
  129. maintaining the flexibility of a Scatter-Gather memory
  130. policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe
  131. Gen2/3 IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA
  132. transmissions to main system memory and GPU memory are both supported. Two
  133. FIFOs, with a data width of 256 bits and operating at 250 MHz, act as user-
  134. friendly interfaces with the custom logic with an input bandwidth of 7.45
  135. GB/s. The user logic and the DMA engine are configured by the host through PIO
  136. registers.
  137. The physical addresses of the host's memory buffers are stored into an internal
  138. memory and are dynamically updated by the driver or user, allowing highly
  139. efficient zero-copy data transfers. The maximum size associated with each
  140. address is 2 GB. The resource utilization
  141. on a Virtex 7 device is reported in \ref{table:utilization}.
  142. \begin{table}[]
  143. \centering
  144. \caption{Resource utilization on a Virtex7 device X240VT}
  145. \label{table:utilization}
  146. \begin{tabular}{@{}llll@{}}
  147. \toprule
  148. Resource & Utilization & Available & Utilization \% \\
  149. \midrule
  150. LUT & 5331 & 433200 & 1.23 \\
  151. LUTRAM & 56 & 174200 & 0.03 \\
  152. FF & 5437 & 866400 & 0.63 \\
  153. BRAM & 20.50 & 1470 & 1.39 \\
  154. \bottomrule
  155. \end{tabular}
  156. \end{table}
  157. \subsection{OpenCL management on host side}
  158. \label{sec:host}
  159. On the host side, AMD's DirectGMA technology, an implementation of the bus-
  160. addressable memory extension for OpenCL 1.1 and later, is used to write from
  161. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  162. \figref{fig:opencl-setup} illustrates the main mode of operation: to write
  163. into the GPU, the physical bus addresses of the GPU buffers are determined
  164. with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
  165. host CPU in a control register of the FPGA (1). The FPGA then writes data
  166. blocks autonomously in DMA fashion (2). Due to hardware restrictions the
  167. largest possible GPU buffer sizes are about 95 MB but larger transfers can be
  168. achieved by using a double buffering mechanism.
  169. Because the GPU provides a flat memory address space and our DMA engine allows
  170. multiple destination addresses to be set in advance, we can determine all
  171. addresses before the actual transfers thus keeping the CPU out of the transfer
  172. loop for data sizes less than 95 MB.
  173. To signal events to the FPGA (4), the control registers can be mapped into the
  174. GPU's address space passing a special AMD-specific flag and passing the
  175. physical BAR address of the FPGA configuration memory to the
  176. \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is seen
  177. transparently as regular GPU memory and can be written accordingly (3). In our
  178. setup, trigger registers are used to notify the FPGA on successful or failed
  179. evaluation of the data.
  180. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to
  181. write entire memory regions in DMA fashion to the FPGA. In this case, the GPU
  182. acts as bus master and pushes data to the FPGA.
  183. \begin{figure}
  184. \centering
  185. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  186. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  187. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  188. mapping the FPGA control registers into the address space of the GPU.}
  189. \label{fig:opencl-setup}
  190. \end{figure}
  191. To process the data, we encapsulated the DMA setup and memory mapping in a
  192. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
  193. This framework allows for an easy construction of streamed data processing on
  194. heterogeneous multi-GPU systems. For example, to read data from the FPGA,
  195. decode from its specific data format and run a Fourier transform on the GPU as
  196. well as writing back the results to disk, one can run the following on the
  197. command line:
  198. \begin{verbatim}
  199. ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  200. \end{verbatim}
  201. The framework takes care of scheduling the tasks and distributing the data
  202. items to one or more GPUs. High throughput is achieved by the combination of
  203. fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
  204. data item on a GPU using thousands of threads and by splitting the data stream
  205. and feeding individual data items to separate GPUs. None of this requires any
  206. user intervention and is solely determined by the framework in an automatized
  207. fashion. A complementary application programming interface allows users to
  208. develop custom applications written in C or high-level languages such as
  209. Python.
  210. \section{Results}
  211. We carried out performance measurements on a machine with an Intel Xeon
  212. E5-1630 at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux
  213. 3.11.10. The Xilinx VC709 evaluation board was plugged into one of the PCIe
  214. 3.0 x8 slots. In case of FPGA-to-CPU data transfers, the software
  215. implementation is the one described in~\cite{rota2015dma}.
  216. \subsection{Throughput}
  217. \begin{figure}
  218. \includegraphics[width=\textwidth]{figures/throughput}
  219. \caption{%
  220. Measured results for data transfers from FPGA to main memory
  221. (CPU) and from FPGA to the global GPU memory (GPU).
  222. }
  223. \label{fig:throughput}
  224. \end{figure}
  225. % \begin{figure}
  226. % \centering
  227. % \begin{subfigure}[b]{.49\textwidth}
  228. % \centering
  229. % \includegraphics[width=\textwidth]{figures/throughput}
  230. % \caption{%
  231. % DMA data transfer throughput.
  232. % }
  233. % \label{fig:throughput}
  234. % \end{subfigure}
  235. % \begin{subfigure}[b]{.49\textwidth}
  236. % \includegraphics[width=\textwidth]{figures/latency}
  237. % \caption{%
  238. % Latency distribution.
  239. % % for a single 4 KB packet transferred
  240. % % from FPGA-to-CPU and FPGA-to-GPU.
  241. % }
  242. % \label{fig:latency}
  243. % \end{subfigure}
  244. % \caption{%
  245. % Measured throuhput for data transfers from FPGA to main memory
  246. % (CPU) and from FPGA to the global GPU memory (GPU).
  247. % }
  248. % \end{figure}
  249. The measured results for the pure data throughput is shown in
  250. \figref{fig:throughput} for transfers from the FPGA to the system's main
  251. memory as well as to the global memory as explained in \ref{sec:host}.
  252. % Must ask Suren about this
  253. In the case of FPGA-to-GPU data transfers, the double buffering solution was
  254. used. As one can see, in both cases the write performance is primarily limited
  255. by the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU
  256. is approaching slowly 100 MB/s. From there on, the throughput increases up to
  257. 6.4 GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
  258. throughput saturates earlier but the maximum throughput is 6.6 GB/s.
  259. % We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
  260. % system based on an Intel Atom CPU. The results showed no significant difference
  261. % compared to the previous setup. Depending on the application and computing
  262. % requirements, this result makes smaller acquisition system a cost-effective
  263. % alternative to larger workstations.
  264. % \begin{figure}
  265. % \includegraphics[width=\textwidth]{figures/intra-copy}
  266. % \caption{%
  267. % Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
  268. % (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
  269. % performance for smaller block sizes is caused by the larger amount of
  270. % transfers required to fill the destination buffer. The throughput has been
  271. % estimated using the host side wall clock time. The raw GPU data transfer as
  272. % measured per event profiling is about twice as fast.
  273. % }
  274. % \label{fig:intra-copy}
  275. % \end{figure}
  276. In order to write more than the maximum possible transfer size of 95 MB, we
  277. repeatedly wrote to the same sized buffer which is not possible in a real-
  278. world application. As a solution, we motivated the use of multiple copies in
  279. Section \ref{sec:host}. To verify that we can keep up with the incoming data
  280. throughput using this strategy, we measured the data throughput within a GPU
  281. by copying data from a smaller sized buffer representing the DMA buffer to a
  282. larger destination buffer. At a block size of about 384 KB the throughput
  283. surpasses the maximum possible PCIe bandwidth, and it reaches 40 GB/s for
  284. blocks bigger than 5 MB. Double buffering is therefore a viable solution for
  285. very large data transfers, where throughput performance is favoured over
  286. latency.
  287. % \figref{fig:intra-copy} shows the measured throughput for
  288. % three sizes and an increasing block size.
  289. \subsection{Latency}
  290. \begin{figure}
  291. \includegraphics[width=\textwidth]{figures/latency-hist}
  292. \caption{%
  293. Latency distribution for a single 1024 B packet transferred from FPGA to
  294. GPU memory and to main memory.
  295. }
  296. \label{fig:latency-distribution}
  297. \end{figure}
  298. For HEP experiments, low latencies are necessary to react in a reasonable time
  299. frame. In order to measure the latency caused by the communication overhead we
  300. conducted the following protocol: 1) the host issues continuous data transfers
  301. of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
  302. \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
  303. input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
  304. engine thus pushing back the data to the GPU. 3) At some point, the host enables
  305. generation of data different from initial value which also starts an internal
  306. FPGA counter with 4 ns resolution. 4) When the generated data is received again
  307. at the FPGA, the counter is stopped. 5) The host program reads out the counter
  308. values and computes the round-trip latency. The distribution of 10000
  309. measurements of the one-way latency is shown in \figref{fig:latency-hist}.
  310. [\textbf{REWRITE THIS PART}] The GPU latency has a mean value of 84.38 \textmu s
  311. and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the CPU
  312. latency of 76.89 \textmu s that was measured using the same driver and measuring
  313. procedure. The non-Gaussian distribution with two distinct peaks indicates a
  314. systemic influence that we cannot control and is most likely caused by the
  315. non-deterministic run-time behaviour of the operating system scheduler.
  316. \section{Conclusion and outlook}
  317. We developed a hardware and software solution that enables DMA transfers
  318. between FPGA-based readout systems and GPU computing clusters. The software
  319. solution that we proposed allows seamless multi-GPU processing of the incoming
  320. data, due to the integration in our streamed computing framework. This allows
  321. straightforward integration with different DAQ systems and introduction of
  322. custom data processing algorithms.
  323. The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
  324. for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
  325. By writing directly into GPU memory instead of routing data through system
  326. main memory, the overall latency of the system can be reduced, thus allowing
  327. close massively parallel computation on GPUs. Optimization of the GPU DMA
  328. interfacing code is ongoing with the help of technical support by AMD. With a
  329. better understanding of the hardware and software aspects of DirectGMA, we
  330. expect a significant improvement in the latency performance.
  331. In order to increase the total throughput, a custom FPGA evaluation board is
  332. currently under development. The board mounts a Virtex-7 chip and features two
  333. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
  334. Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
  335. as a single x16 device by using an external PCIe switch. With two cores
  336. operating in parallel, we foresee an increase in the data throughput by a
  337. factor of 2 (as demonstrated in~\cite{rota2015dma}).
  338. Support for NVIDIA's GPUDirect technology is also foreseen in the next months
  339. to lift the limitation of one specific GPU vendor and compare the performance
  340. of hardware by different vendors. Further improvements are expected by
  341. generalizing the transfer mechanism and include Infiniband support besides the
  342. existing PCIe connection.
  343. %% Where do we get this values? Any reference?
  344. %This allows
  345. %speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
  346. Our goal is to develop a unique hybrid solution, based on commercial standards,
  347. that includes fast data transmission protocols and a high performance GPU
  348. computing framework.
  349. \acknowledgments
  350. This work was partially supported by the German-Russian BMBF funding programme,
  351. grant numbers 05K10CKB and 05K10VKE.
  352. \bibliographystyle{JHEP}
  353. \bibliography{literature}
  354. \end{document}