paper.tex 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \usepackage{caption}
  6. \usepackage{subcaption}
  7. \usepackage{textcomp}
  8. \usepackage{booktabs}
  9. \usepackage{floatrow}
  10. \newfloatcommand{capbtabbox}{table}[][\FBwidth]
  11. \newboolean{draft}
  12. \setboolean{draft}{true}
  13. \newcommand{\figref}[1]{Figure~\ref{#1}}
  14. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  15. \author{
  16. L.~Rota$^a$,
  17. M.~Vogelgesang$^a$,
  18. L.E.~Ardila Perez$^a$,
  19. M.~Caselle$^a$,
  20. S.~Chilingaryan$^a$,
  21. T.~Dritschler$^a$,
  22. N.~Zilio$^a$,
  23. A.~Kopmann$^a$,
  24. M.~Balzer$^a$,
  25. M.~Weber$^a$\\
  26. \llap{$^a$}Institute for Data Processing and Electronics,\\
  27. Karlsruhe Institute of Technology (KIT),\\
  28. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
  29. E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
  30. }
  31. \abstract{ Modern physics experiments have reached multi-GB/s data rates. Fast
  32. data links and high performance computing stages are required for continuous
  33. data acquisition and processing. Because of their intrinsic parallelism and
  34. computational power, GPUs emerged as an ideal solution to process this data in
  35. high performance computing applications. In this paper we present a high-
  36. throughput platform based on direct FPGA-GPU communication. The
  37. architecture consists of a Direct Memory Access (DMA) engine compatible with
  38. the Xilinx PCI-Express core, a Linux driver for register access, and high-
  39. level software to manage direct memory transfers using AMD's DirectGMA
  40. technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
  41. for transfers to GPU memory and 6.6~GB/s to system memory. We
  42. also evaluated DirectGMA performance for low latency applications: preliminary
  43. results show a round-trip latency of 2 \textmu s for data sizes up to 4 kB.
  44. Our implementation is suitable for real- time DAQ system applications ranging
  45. from photon science and medical imaging to High Energy Physics (HEP) trigger
  46. systems. }
  47. \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
  48. \begin{document}
  49. \ifdraft
  50. \setpagewiselinenumbers
  51. \linenumbers
  52. \fi
  53. \section{Introduction}
  54. GPU computing has become the main driving force for high performance computing
  55. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  56. acceleration has found its way into numerous applications, ranging from
  57. simulation to image processing.
  58. The data rates of bio-imaging or beam-monitoring experiments running in
  59. current generation photon science facilities have reached tens of
  60. GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
  61. back-end readout systems and then transmitted in short bursts or in a
  62. continuous streaming mode to a computing stage. In order to collect data over
  63. long observation times, the readout architecture and the computing stages must
  64. be able to sustain high data rates.
  65. Recent years have also seen an increasing
  66. interest in GPU-based systems for High Energy Physics (HEP) (\emph{e.g.}
  67. ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
  68. PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
  69. applications, latency becomes the most stringent requirement for , \emph{e.g.} in Low/High-level trigger systems.
  70. Due to its high bandwidth and modularity, PCIe quickly became the commercial
  71. standard for connecting high-throughput peripherals such as GPUs or solid
  72. state disks. Moreover, optical PCIe networks have been demonstrated a decade
  73. ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
  74. communication link over long distances.
  75. Several solutions for direct FPGA/GPU communication based on PCIe are reported
  76. in literature, and all of them are based on NVIDIA's GPUdirect technology. In
  77. the implementation of bittnerner and Ruf ~\cite{bittner} the GPU acts as
  78. master during an FPGA-to-GPU data transfer, reading data from the FPGA. This
  79. solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
  80. s, respectively.
  81. %LR: FPGA^2 it's the name of their thing...
  82. %MV: best idea in the world :)
  83. When the FPGA is used as a master, a higher throughput can be achieved. An
  84. example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
  85. et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
  86. Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
  87. PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
  88. limits the latency performance of the system to a few tens of \textmu s. If
  89. only the FPGA-to-GPU latency is considered, the measured values span between
  90. 1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
  91. bandwidth saturates at 120 MB/s. Nieto et~al.\ presented a system based on a
  92. PXIexpress data link that makes use of four PCIe 1.0
  93. links~\cite{nieto2015high}. Their system (as limited by the interconnect)
  94. achieves an average throughput of 870 MB/s with 1 KB block transfers.
  95. In order to achieve the best performance in terms of latency and bandwidth, we
  96. developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
  97. process the data, we encapsulated the DMA setup and memory mapping in a plugin
  98. for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  99. framework allows for an easy construction of streamed data processing on
  100. heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
  101. integration with NVIDIA's CUDA functions for GPUDirect technology is not
  102. possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
  103. direct FPGA-to-GPU communication into our processing pipeline. In this paper
  104. we report the performance of our DMA engine for FPGA-to-CPU communication and
  105. some preliminary measurements about DirectGMA's performance in low-latency
  106. applications.
  107. %% LR: this part -> OK
  108. \section{Architecture}
  109. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  110. data through system main memory by copying data from the FPGA into
  111. intermediate buffers and then finally into the GPU's main memory. Thus, the
  112. total throughput and latency of the system is limited by the main memory
  113. bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
  114. communication between GPUs and auxiliary devices over PCIe. By combining this
  115. technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
  116. overall latency of the system is reduced and total throughput increased.
  117. Moreover, the CPU and main system memory are relieved from processing because
  118. they are not directly involved in the data transfer anymore.
  119. \begin{figure}[t]
  120. \centering
  121. \includegraphics[width=1.0\textwidth]{figures/transf}
  122. \caption{%
  123. In a traditional DMA architecture (a), data are first written to the main
  124. system memory and then sent to the GPUs for final processing. By using
  125. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  126. the GPU's internal memory.
  127. }
  128. \label{fig:trad-vs-dgpu}
  129. \end{figure}
  130. %% LR: this part -> Text:OK, Figure: must be updated
  131. \subsection{DMA engine implementation on the FPGA}
  132. We have developed a DMA engine that minimizes resource utilization while
  133. maintaining the flexibility of a Scatter-Gather memory
  134. policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}. The engine is compatible with the Xilinx PCIe
  135. Gen2/3 IP- Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA data
  136. transfers to/from main system memory and GPU memory are supported. Two FIFOs,
  137. with a data width of 256 bits and operating at 250 MHz, act as user- friendly
  138. interfaces with the custom logic with an input bandwidth of 7.45 GB/s. The
  139. user logic and the DMA engine are configured by the host through PIO
  140. registers. The resource
  141. utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
  142. \begin{figure}[t]
  143. \small
  144. \begin{floatrow}
  145. \ffigbox{%
  146. \includegraphics[width=0.4\textwidth]{figures/fpga-arch}
  147. }{%
  148. \caption{A figure}%
  149. \label{fig:fpga-arch}
  150. }
  151. \capbtabbox{%
  152. \begin{tabular}{@{}llll@{}}
  153. \toprule
  154. Resource & Utilization & (\%) \\
  155. \midrule
  156. LUT & 5331 & (1.23) \\
  157. LUTRAM & 56 & (0.03) \\
  158. FF & 5437 & (0.63) \\
  159. BRAM & 21 & (1.39) \\
  160. % Resource & Utilization & Available & Utilization \% \\
  161. % \midrule
  162. % LUT & 5331 & 433200 & 1.23 \\
  163. % LUTRAM & 56 & 174200 & 0.03 \\
  164. % FF & 5437 & 866400 & 0.63 \\
  165. % BRAM & 20.50 & 1470 & 1.39 \\
  166. \bottomrule
  167. \end{tabular}
  168. }{%
  169. \caption{Resource utilization on a xc7vx690t-ffg1761 device}%
  170. \label{table:utilization}
  171. }
  172. \end{floatrow}
  173. \end{figure}
  174. % \begin{figure}[tb]
  175. % \centering
  176. % \includegraphics[width=0.6\textwidth]{figures/fpga-arch}
  177. % \caption{%
  178. % Architecture of the DMA engine.
  179. % }
  180. % \label{fig:fpga-arch}
  181. % \end{figure}
  182. The physical addresses of the host's memory buffers are stored into an internal
  183. memory and are dynamically updated by the driver or user, allowing highly
  184. efficient zero-copy data transfers. The maximum size associated with each
  185. address is 2 GB.
  186. %% LR: -----------------> OK
  187. \subsection{OpenCL management on host side}
  188. \label{sec:host}
  189. \begin{figure}[b]
  190. \centering
  191. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  192. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  193. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  194. mapping the FPGA control registers into the address space of the GPU.}
  195. \label{fig:opencl-setup}
  196. \end{figure}
  197. %% Description of figure
  198. On the host side, AMD's DirectGMA technology, an implementation of the bus-
  199. addressable memory extension for OpenCL 1.1 and later, is used to write from
  200. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  201. \figref{fig:opencl-setup} illustrates the main mode of operation: to write
  202. into the GPU, the physical bus addresses of the GPU buffers are determined
  203. with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
  204. host CPU in a control register of the FPGA (1). The FPGA then writes data
  205. blocks autonomously in DMA fashion (2). To signal events to the FPGA (4), the
  206. control registers can be mapped into the GPU's address space passing a special
  207. AMD-specific flag and passing the physical BAR address of the FPGA
  208. configuration memory to the \texttt{cl\-Create\-Buffer} function. From the
  209. GPU, this memory is seen transparently as regular GPU memory and can be
  210. written accordingly (3). In our setup, trigger registers are used to notify
  211. the FPGA on successful or failed evaluation of the data. Using the
  212. \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write
  213. entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
  214. as bus master and pushes data to the FPGA.
  215. %% Double Buffering strategy. Removed figure.
  216. Due to hardware restrictions the largest possible GPU buffer sizes are about
  217. 95 MB but larger transfers can be achieved by using a double buffering
  218. mechanism. Because the GPU provides a flat memory address space and our DMA
  219. engine allows multiple destination addresses to be set in advance, we can
  220. determine all addresses before the actual transfers thus keeping the CPU out
  221. of the transfer loop for data sizes less than 95 MB.
  222. %% Ufo Framework
  223. To process the data, we encapsulated the DMA setup and memory mapping in a
  224. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
  225. This framework allows for an easy construction of streamed data processing on
  226. heterogeneous multi-GPU systems. For example, to read data from the FPGA,
  227. decode from its specific data format and run a Fourier transform on the GPU as
  228. well as writing back the results to disk, one can run the following on the
  229. command line:
  230. \begin{verbatim}
  231. ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  232. \end{verbatim}
  233. The framework takes care of scheduling the tasks and distributing the data
  234. items to one or more GPUs. High throughput is achieved by the combination of
  235. fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
  236. data item on a GPU using thousands of threads and by splitting the data stream
  237. and feeding individual data items to separate GPUs. None of this requires any
  238. user intervention and is solely determined by the framework in an automatized
  239. fashion. A complementary application programming interface allows users to
  240. develop custom applications written in C or high-level languages such as
  241. Python.
  242. \section{Results}
  243. We carried out performance measurements on two different setups, which are
  244. described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
  245. used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
  246. 3.0 slot.
  247. %LR: explain this root-complex shit here
  248. In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected
  249. to a Netstor NA255A external PCIe enclosure. In case of FPGA-to-CPU data
  250. transfers, the software implementation is the one described
  251. in~\cite{rota2015dma}.
  252. % \begin{table}[]
  253. % \centering
  254. % \caption{Resource utilization on a Virtex7 device X240VT}
  255. % \label{table:utilization}
  256. % \tabcolsep=0.11cm
  257. % \small
  258. % \begin{tabular}{@{}llll@{}}
  259. % \toprule
  260. % Resource & Utilization & Utilization \% \\
  261. % \midrule
  262. % LUT & 5331 & 1.23 \\
  263. % LUTRAM & 56 & 0.03 \\
  264. % FF & 5437 & 0.63 \\
  265. % BRAM & 20.50 & 1.39 \\
  266. % % Resource & Utilization & Available & Utilization \% \\
  267. % % \midrule
  268. % % LUT & 5331 & 433200 & 1.23 \\
  269. % % LUTRAM & 56 & 174200 & 0.03 \\
  270. % % FF & 5437 & 866400 & 0.63 \\
  271. % % BRAM & 20.50 & 1470 & 1.39 \\
  272. % \bottomrule
  273. % \end{tabular}
  274. % \end{table}
  275. \begin{table}[]
  276. \centering
  277. \small
  278. \caption{Description of the measurement setup}
  279. \label{table:setups}
  280. \tabcolsep=0.11cm
  281. \begin{tabular}{@{}llll@{}}
  282. \toprule
  283. & Setup 1 & Setup 2 \\
  284. \midrule
  285. CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
  286. Chipset & Intel C612 & Intel ICH9R Express \\
  287. GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
  288. PCIe slot: System memory & x8 Gen3 (same RC) & x4 Gen1 (different RC) \\
  289. PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
  290. \bottomrule
  291. \end{tabular}
  292. \end{table}
  293. \subsection{Throughput}
  294. \begin{figure}[t]
  295. \includegraphics[width=0.85\textwidth]{figures/throughput}
  296. \caption{%
  297. Measured throughput for data transfers from FPGA to main memory
  298. (CPU) and from FPGA to the global GPU memory (GPU).
  299. }
  300. \label{fig:throughput}
  301. \end{figure}
  302. The measured results for the pure data throughput is shown in
  303. \figref{fig:throughput} for transfers from the FPGA to the system's main
  304. memory as well as to the global memory as explained in \ref{sec:host}. In the
  305. case of FPGA-to-GPU data transfers, the double buffering solution was used:
  306. data are copied from the buffer exposed to FPGA into a different buffer.
  307. As one can see, in both cases the write performance is primarily limited by
  308. the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
  309. approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
  310. GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
  311. throughput saturates earlier but the maximum throughput is 6.6 GB/s.
  312. % \begin{figure}
  313. % \includegraphics[width=\textwidth]{figures/intra-copy}
  314. % \caption{%
  315. % Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
  316. % (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
  317. % performance for smaller block sizes is caused by the larger amount of
  318. % transfers required to fill the destination buffer. The throughput has been
  319. % estimated using the host side wall clock time. The raw GPU data transfer as
  320. % measured per event profiling is about twice as fast.
  321. % }
  322. % \label{fig:intra-copy}
  323. % \end{figure}
  324. In order to write more than the maximum possible transfer size of 95 MB, we
  325. repeatedly wrote to the same sized buffer which is not possible in a real-
  326. world application. As a solution, we motivated the use of multiple copies in
  327. Section \ref{sec:host}. To verify that we can keep up with the incoming data
  328. throughput using this strategy, we measured the data throughput within a GPU
  329. by copying data from a smaller sized buffer representing the DMA buffer to a
  330. larger destination buffer. At a block size of about 384 KB the throughput
  331. surpasses the maximum possible PCIe bandwidth, and it reaches 40 GB/s for
  332. blocks bigger than 5 MB. Double buffering is therefore a viable solution for
  333. very large data transfers, where throughput performance is favoured over
  334. latency.
  335. % \figref{fig:intra-copy} shows the measured throughput for
  336. % three sizes and an increasing block size.
  337. \subsection{Latency}
  338. \begin{figure}[t]
  339. \centering
  340. \begin{subfigure}[b]{.45\textwidth}
  341. \centering
  342. \includegraphics[width=\textwidth]{figures/latency}
  343. \caption{Latency }
  344. \label{fig:latency_vs_size}
  345. \end{subfigure}
  346. \begin{subfigure}[b]{.45\textwidth}
  347. \includegraphics[width=\textwidth]{figures/latency-hist}
  348. \caption{Latency distribution.}
  349. \label{fig:latency_hist}
  350. \end{subfigure}
  351. \label{fig:latency}
  352. \end{figure}
  353. For HEP experiments, low latencies are necessary to react in a reasonable time
  354. frame. In order to measure the latency caused by the communication overhead we
  355. conducted the following protocol: 1) the host issues continuous data transfers
  356. of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
  357. \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
  358. input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
  359. engine thus pushing back the data to the GPU. 3) At some point, the host enables
  360. generation of data different from initial value which also starts an internal
  361. FPGA counter with 4 ns resolution. 4) When the generated data is received again
  362. at the FPGA, the counter is stopped. 5) The host program reads out the counter
  363. values and computes the round-trip latency. The distribution of 10000
  364. measurements of the one-way latency is shown in \figref{fig:latency-hist}.
  365. [\textbf{REWRITE THIS PART}] The GPU latency has a mean value of 84.38 \textmu s
  366. and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the CPU
  367. latency of 76.89 \textmu s that was measured using the same driver and measuring
  368. procedure. The non-Gaussian distribution with two distinct peaks indicates a
  369. systemic influence that we cannot control and is most likely caused by the
  370. non-deterministic run-time behaviour of the operating system scheduler.
  371. \section{Conclusion and outlook}
  372. We developed a hardware and software solution that enables DMA transfers
  373. between FPGA-based readout systems and GPU computing clusters.
  374. The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
  375. for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
  376. data transfer. The measurements on a low-end system based on an Intel Atom CPU
  377. showed no significant difference in throughput performance. Depending on the
  378. application and computing requirements, this result makes smaller acquisition
  379. system a cost-effective alternative to larger workstations.
  380. We also evaluated the performance of DirectGMA technology for low latency
  381. applications. Preliminary results indicate that latencies as low as 2 \textmu
  382. s can be achieved in data transfer to GPU memory. As opposed to the previous
  383. case, for latency applications measurements show that dedicated hardware is
  384. required in order to achieve the best performance. Optimization of the GPU-DMA
  385. interfacing code is ongoing with the help of technical support by AMD. With a
  386. better understanding of the hardware and software aspects of DirectGMA, we
  387. expect a significant improvement in the latency performance.
  388. In order to increase the total throughput, a custom FPGA evaluation board is
  389. currently under development. The board mounts a Virtex-7 chip and features two
  390. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
  391. Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
  392. as a single x16 device by using an external PCIe switch. With two cores
  393. operating in parallel, we foresee an increase in the data throughput by a
  394. factor of 2 (as demonstrated in~\cite{rota2015dma}).
  395. The software solution that we proposed allows seamless multi-GPU processing of
  396. the incoming data, due to the integration in our streamed computing framework.
  397. This allows straightforward integration with different DAQ systems and
  398. introduction of custom data processing algorithms.
  399. Support for NVIDIA's GPUDirect technology is also foreseen in the next months
  400. to lift the limitation of one specific GPU vendor and compare the performance
  401. of hardware by different vendors. Further improvements are expected by
  402. generalizing the transfer mechanism and include Infiniband support besides the
  403. existing PCIe connection.
  404. %% Where do we get this values? Any reference?
  405. %This allows
  406. %speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
  407. Our goal is to develop a unique hybrid solution, based on commercial standards,
  408. that includes fast data transmission protocols and a high performance GPU
  409. computing framework.
  410. \acknowledgments
  411. This work was partially supported by the German-Russian BMBF funding programme,
  412. grant numbers 05K10CKB and 05K10VKE.
  413. \bibliographystyle{JHEP}
  414. \bibliography{literature}
  415. \end{document}