paper.tex 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \usepackage{caption}
  6. \usepackage{subcaption}
  7. \usepackage{textcomp}
  8. \usepackage{booktabs}
  9. \usepackage{floatrow}
  10. \newfloatcommand{capbtabbox}{table}[][\FBwidth]
  11. \newboolean{draft}
  12. \setboolean{draft}{true}
  13. \newcommand{\figref}[1]{Figure~\ref{#1}}
  14. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  15. \author{
  16. L.~Rota$^a$,
  17. M.~Vogelgesang$^a$,
  18. L.E.~Ardila Perez$^a$,
  19. M.~Caselle$^a$,
  20. S.~Chilingaryan$^a$,
  21. T.~Dritschler$^a$,
  22. N.~Zilio$^a$,
  23. A.~Kopmann$^a$,
  24. M.~Balzer$^a$,
  25. M.~Weber$^a$\\
  26. \llap{$^a$}Institute for Data Processing and Electronics,\\
  27. Karlsruhe Institute of Technology (KIT),\\
  28. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
  29. E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
  30. }
  31. \abstract{%
  32. Modern physics experiments have reached multi-GB/s data rates. Fast data links
  33. and high performance computing stages are required for continuous data
  34. acquisition and processing. Because of their intrinsic parallelism and
  35. computational power, GPUs emerged as an ideal solution to process this data in
  36. high performance computing applications. In this paper we present a high-
  37. throughput platform based on direct FPGA-GPU communication. The architecture
  38. consists of a Direct Memory Access (DMA) engine compatible with the Xilinx
  39. PCI-Express core, a Linux driver for register access, and high- level software
  40. to manage direct memory transfers using AMD's DirectGMA technology.
  41. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s for transfers
  42. to GPU memory and 6.6~GB/s to system memory. We also assessed the possibility
  43. of using the architecture in low latency systems: preliminary measurements
  44. show a round-trip latency as low as 1 \textmu s for data transfers to system
  45. memory, while the additional latency introduced by OpenCL scheduling is the
  46. current limitation for GPU based systems. Our implementation is suitable for
  47. real-time DAQ system applications ranging from photon science and medical
  48. imaging to High Energy Physics (HEP) systems.
  49. }
  50. \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
  51. \begin{document}
  52. \ifdraft
  53. \setpagewiselinenumbers
  54. \linenumbers
  55. \fi
  56. \section{Introduction}
  57. GPU computing has become the main driving force for high performance computing
  58. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  59. acceleration has found its way into numerous applications, ranging from
  60. simulation to image processing.
  61. The data rates of bio-imaging or beam-monitoring experiments running in current
  62. generation photon science facilities have reached tens of GB/s~\cite{ufo_camera,
  63. caselle}. In a typical scenario, data are acquired by back-end readout systems
  64. and then transmitted in short bursts or continuously streamed to a computing
  65. stage. In order to collect data over long observation times, the readout
  66. architecture and the computing stages must be able to sustain high data rates.
  67. Recent years have also seen an increasing interest in GPU-based systems for High
  68. Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
  69. ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and photon
  70. science experiments. In time-deterministic applications, such as Low/High-level
  71. trigger systems, latency becomes the most stringent requirement.
  72. Due to its high bandwidth and modularity, PCIe is the commercial \emph{de facto}
  73. standard for connecting high-throughput peripherals such as GPUs or solid state
  74. disks. Moreover, optical PCIe networks have been demonstrated a decade
  75. ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
  76. communication link over long distances.
  77. Several solutions for direct FPGA-GPU communication based on PCIe and NVIDIA's
  78. proprietary GPUdirect technology are reported in the literature. In the
  79. implementation of Bittner and Ruf the GPU acts as master during an FPGA-to-GPU
  80. read data transfer \cite{bittner}. This solution limits the reported bandwidth
  81. and latency to 514 MB/s and 40~\textmu s, respectively. When the FPGA is used
  82. as a master, a higher throughput can be achieved. An example of this approach
  83. is the \emph{FPGA\textsuperscript{2}} framework by Thoma et~al.\cite{thoma},
  84. which reaches 2454 MB/s using a PCIe 2.0 8x data link. Lonardo et~al.\ achieved
  85. low latencies with their NaNet design, an FPGA-based PCIe network interface
  86. card~\cite{lonardo2015nanet}. The Gbe link however limits the latency
  87. performance of the system to a few tens of \textmu s. If only the FPGA-to-GPU
  88. latency is considered, the measured values span between 1~\textmu s and
  89. 6~\textmu s, depending on the datagram size. Nieto et~al.\ presented a system
  90. based on a PXIexpress data link that makes use of four PCIe 1.0
  91. links~\cite{nieto2015high}. Their system, as limited by the interconnect,
  92. achieves an average throughput of 870 MB/s with 1 KB block transfers.
  93. In order to achieve the best performance in terms of latency and bandwidth, we
  94. developed a high-performance DMA engine based on Xilinx's PCIe 3.0 Core. To
  95. process the data, we encapsulated the DMA setup and memory mapping in a plugin
  96. for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  97. framework allows for an easy construction of streamed data processing on
  98. heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
  99. integration with NVIDIA's CUDA functions for GPUDirect technology is not
  100. possible at the moment. We therefore used AMD's DirectGMA technology to
  101. integrate direct FPGA-to-GPU communication into our processing pipeline. In this
  102. paper we present the hardware/software interface and report the throughput
  103. performance of our architecture together with preliminary measurements about
  104. DirectGMA's applicability in low-latency applications.
  105. %% LR: this part -> OK
  106. \section{Architecture}
  107. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  108. data through system main memory by copying data from the FPGA into intermediate
  109. buffers and then finally into the GPU's main memory. Thus, the total throughput
  110. and latency of the system is limited by the main memory bandwidth. NVIDIA's
  111. GPUDirect and AMD's DirectGMA technologies allow direct communication between
  112. GPUs and auxiliary devices over PCIe. By combining this technology with DMA data
  113. transfers as shown in \figref{fig:trad-vs-dgpu} (b), the overall latency of the
  114. system is reduced and total throughput increased. Moreover, the CPU and main
  115. system memory are relieved from processing because they are not directly
  116. involved in the data transfer anymore.
  117. \begin{figure}[t]
  118. \centering
  119. \includegraphics[width=1.0\textwidth]{figures/transf}
  120. \caption{%
  121. In a traditional DMA architecture (a), data are first written to the main
  122. system memory and then sent to the GPUs for final processing. By using
  123. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  124. the GPU's internal memory.
  125. }
  126. \label{fig:trad-vs-dgpu}
  127. \end{figure}
  128. %% LR: this part -> Text:OK, Figure: must be updated
  129. \subsection{DMA engine implementation on the FPGA}
  130. We have developed a DMA engine that minimizes resource utilization while
  131. maintaining the flexibility of a Scatter-Gather memory
  132. policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}.
  133. The engine is compatible with the Xilinx PCIe 2.0/3.0 IP-Core~\cite{xilinxgen3}
  134. for Xilinx FPGA families 6 and 7. DMA data transfers between main system memory
  135. and GPU memory are supported. Two FIFOs,operating at 250 MHz and a data width of
  136. 256 bits, act as user-friendly interfaces with the custom logic at an input
  137. bandwidth of 7.45 GB/s. The user logic and the DMA engine are configured by the
  138. host system through PIO registers. The resource utilization on a Virtex 7 device
  139. is reported in Table~\ref{table:utilization}.
  140. \begin{figure}[t]
  141. \small
  142. \begin{floatrow}
  143. \ffigbox{%
  144. \includegraphics[width=0.4\textwidth]{figures/fpga-arch}
  145. }{%
  146. \caption{A figure}%
  147. \label{fig:fpga-arch}
  148. }
  149. \capbtabbox{%
  150. \begin{tabular}{@{}llll@{}}
  151. \toprule
  152. Resource & Utilization & (\%) \\
  153. \midrule
  154. LUT & 5331 & (1.23) \\
  155. LUTRAM & 56 & (0.03) \\
  156. FF & 5437 & (0.63) \\
  157. BRAM & 21 & (1.39) \\
  158. \bottomrule
  159. \end{tabular}
  160. }{%
  161. \caption{Resource utilization on a xc7vx690t-ffg1761 device.}%
  162. \label{table:utilization}
  163. }
  164. \end{floatrow}
  165. \end{figure}
  166. The physical addresses of the host's memory buffers are stored into an internal
  167. memory and are dynamically updated by the driver or user, allowing highly
  168. efficient zero-copy data transfers. The maximum size associated with each
  169. address is 2 GB.
  170. %% LR: -----------------> OK
  171. \subsection{OpenCL management on host side}
  172. \label{sec:host}
  173. \begin{figure}[b]
  174. \centering
  175. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  176. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  177. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  178. mapping the FPGA control registers into the address space of the GPU.}
  179. \label{fig:opencl-setup}
  180. \end{figure}
  181. %% Description of figure
  182. On the host side, AMD's DirectGMA technology, an implementation of the
  183. bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
  184. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  185. \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
  186. the GPU, the physical bus addresses of the GPU buffers are determined with a
  187. call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU
  188. in a control register of the FPGA (1). The FPGA then writes data blocks
  189. autonomously in DMA fashion (2). To signal events to the FPGA (4), the control
  190. registers can be mapped into the GPU's address space passing a special
  191. AMD-specific flag and passing the physical BAR address of the FPGA configuration
  192. memory to the \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is
  193. seen transparently as regular GPU memory and can be written accordingly (3). In
  194. our setup, trigger registers are used to notify the FPGA on successful or failed
  195. evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
  196. call it is possible to write entire memory regions in DMA fashion to the FPGA.
  197. In this case, the GPU acts as bus master and pushes data to the FPGA.
  198. %% Double Buffering strategy.
  199. Due to hardware restrictions with AMD W9100 FirePro cards, the largest possible
  200. GPU buffer sizes are about 95 MB. However, larger transfers can be achieved by
  201. using a double buffering mechanism: data are copied from the buffer exposed to
  202. the FPGA into a different location in GPU memory. To verify that we can keep up
  203. with the incoming data throughput using this strategy, we measured the data
  204. throughput within a GPU by copying data from a smaller sized buffer representing
  205. the DMA buffer to a larger destination buffer. At a block size of about 384 KB
  206. the throughput surpasses the maximum possible PCIe bandwidth. Block transfers
  207. larger than 5 MB saturate the bandwidth at 40 GB/s. Double buffering is
  208. therefore a viable solution for very large data transfers, where throughput
  209. performance is favoured over latency. For data sizes less than 95 MB, we can
  210. determine all addresses before the actual transfers thus keeping the CPU out of
  211. the transfer loop.
  212. %% Ufo Framework
  213. To process the data, we encapsulated the DMA setup and memory mapping in a
  214. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
  215. This framework allows for an easy construction of streamed data processing on
  216. heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
  217. from its specific data format and run a Fourier transform on the GPU as well as
  218. writing back the results to disk, one can run the following on the command line:
  219. \begin{verbatim}
  220. ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  221. \end{verbatim}
  222. The framework takes care of scheduling the tasks and distributing the data items
  223. to one or more GPUs. High throughput is achieved by the combination of fine- and
  224. coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
  225. GPU using thousands of threads and by splitting the data stream and feeding
  226. individual data items to separate GPUs. None of this requires any user
  227. intervention and is solely determined by the framework in an automatized
  228. fashion. A complementary application programming interface allows users to
  229. develop custom applications in C or high-level languages such as Python.
  230. %% --------------------------------------------------------------------------
  231. \section{Results}
  232. \begin{table}[b]
  233. \centering
  234. \small
  235. \caption{Setups used for throughput and latency measurements.}
  236. \label{table:setups}
  237. \tabcolsep=0.11cm
  238. \begin{tabular}{@{}llll@{}}
  239. \toprule
  240. & Setup 1 & Setup 2 \\
  241. \midrule
  242. CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
  243. Chipset & Intel C612 & Intel ICH9R Express \\
  244. GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
  245. PCIe slot: System memory & x8 Gen3 & x4 Gen1 \\
  246. PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
  247. \bottomrule
  248. \end{tabular}
  249. \end{table}
  250. We carried out performance measurements on two different setups, which are
  251. <<<<<<< HEAD
  252. described in table~\ref{table:setups}. In both setups, a Xilinx VC709
  253. evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
  254. into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
  255. (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
  256. Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
  257. were connected to the same RC, as opposed to Setup 1. As stated in the
  258. NVIDIA's GPUDirect documentation, the devices must share the same RC to
  259. achieve the best performance. In case of FPGA-to-CPU data
  260. transfers, the software implementation is the one described
  261. in~\cite{rota2015dma}.
  262. =======
  263. described in table~\ref{table:setups}. In both setups, a Xilinx VC709 evaluation
  264. board was used. In Setup 1, the FPGA board and the GPU were plugged into a PCIe
  265. 3.0 slot, but they were connected to different PCIe Root Complexes (RC). In
  266. Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a Netstor
  267. NA255A external PCIe enclosure. As opposed to Setup 1, both the FPGA board and
  268. the GPU were connected to the same RC. As stated in NVIDIA's GPUDirect
  269. documentation, the devices must share the same RC to achieve the best
  270. performance~\cite{cuda_doc}. In case of FPGA-to-CPU data transfers, the software
  271. implementation is the one described in~\cite{rota2015dma}.
  272. >>>>>>> 0cc20c7925a1e5d52c1884a3dbca3847d99d96c2
  273. \subsection{Throughput}
  274. \begin{figure}[t]
  275. \includegraphics[width=0.85\textwidth]{figures/throughput}
  276. \caption{%
  277. Measured throughput for data transfers from FPGA to main memory
  278. (CPU) and from FPGA to the global GPU memory (GPU) using Setup 1.
  279. }
  280. \label{fig:throughput}
  281. \end{figure}
  282. In order to evaluate the maximum performance of the DMA engine, measurements of
  283. pure data throughput were carried out using Setup 1. The results are shown in
  284. \figref{fig:throughput} for transfers to the system's main memory as well as to
  285. the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the double
  286. buffering mechanism was used. As one can see, in both cases the write
  287. performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
  288. size, the throughput to the GPU is slowly approaching 100 MB/s. From there on,
  289. the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
  290. throughput saturates earlier at a maximum throughput of 6.6 GB/s. The slope and
  291. maximum performance depend on the different implementation of the handshaking
  292. sequence between DMA engine and the hosts. With Setup 2, the PCIe 1.0 link
  293. limits the throughput to system main memory to around 700 MB/s. However,
  294. transfers to GPU memory yielded the same results as Setup 1.
  295. \subsection{Latency}
  296. \begin{figure}[t]
  297. \centering
  298. \begin{subfigure}[b]{.49\textwidth}
  299. \centering
  300. \includegraphics[width=\textwidth]{figures/latency-cpu}
  301. \label{fig:latency-cpu}
  302. \vspace{-0.4\baselineskip}
  303. \caption{System memory transfer latency}
  304. \end{subfigure}
  305. \begin{subfigure}[b]{.49\textwidth}
  306. \includegraphics[width=\textwidth]{figures/latency-gpu}
  307. \label{fig:latency-gpu}
  308. \vspace{-0.4\baselineskip}
  309. \caption{GPU memory transfer latency}
  310. \end{subfigure}
  311. \caption{%
  312. Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).
  313. }
  314. \label{fig:latency}
  315. \end{figure}
  316. We conducted the following test in order to measure the latency introduced by the DMA engine:
  317. 1) the host starts a DMA transfer by issuing the \emph{start\_dma} command,
  318. 2) the DMA engine transmits data into the system main memory,
  319. 3) when all the data has been transferred, the DMA engine notifies the host that
  320. new data is present by writing into a specific address in the system main
  321. memory,
  322. 4) the host acknowledges that data has been received by issuing the \emph{stop\_dma} command.
  323. A counter on the FPGA measures the time interval between the \emph{start\_dma}
  324. and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring the
  325. round-trip latency of the system. The correct ordering of the packets is
  326. guaranteed by the PCIe protocol. The measured round-trip latencies for data
  327. transfers to system main memory and GPU memory are shown in
  328. \figref{fig:latency}.
  329. With Setup 1 and system memory, latencies as low as 1.1 \textmu s can be
  330. achieved for a packet size of 1024 B. Higher latencies and a dependency on size
  331. measured with Setup 2 are caused by the slower PCIe x4 1.0 link connecting the
  332. FPGA board to the system main memory.
  333. The same test was performed when transferring data inside GPU memory. Like in
  334. the previous case, the notification was written into main memory. This approach
  335. was used because a latency of 100 to 200 \textmu s introduced by OpenCL
  336. scheduling did not allow a precise measurement based only on FPGA-to-GPU
  337. communication. When connecting the devices to the same RC, as in Setup 2, a
  338. latency of 2 \textmu s is achieved and limited by the latency to system main
  339. memory, as seen in \figref{fig:latency} (a). On the contrary, if the FPGA board
  340. and the GPU are connected to a different RC as in Setup 1, the latency increases
  341. significantly with packet size. It must be noted that the low latencies measured
  342. with Setup 1 for packet sizes below 1 kB seem to be due to a caching mechanism
  343. inside the PCIe switch, and it is not clear whether data has been successfully
  344. written into GPU memory when the notification is delivered to the CPU. This
  345. effect must be taken into account in future implementations as it could
  346. potentially lead to data corruption.
  347. \section{Conclusion and outlook}
  348. We developed a hardware and software solution that enables DMA transfers
  349. between FPGA-based readout systems and GPU computing clusters.
  350. The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
  351. for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
  352. data transfer. The measurements on a low-end system based on an Intel Atom CPU
  353. showed no significant difference in throughput performance. Depending on the
  354. application and computing requirements, this result makes smaller acquisition
  355. system a cost-effective alternative to larger workstations.
  356. We measured a round-trip latency of 1 \textmu s when transferring data between
  357. the DMA engine and system memory. We also assessed the applicability of
  358. DirectGMA in low latency applications: preliminary results shows that latencies
  359. as low as 2 \textmu s can be achieved during data transfers to GPU memory.
  360. However, at the time of writing this paper, the latency introduced by OpenCL
  361. scheduling is in the range of hundreds of \textmu s. In order to lift this
  362. limitation and make our implementation useful in low-latency applications, we
  363. are currently optimizing the the GPU-DMA interfacing OpenCL code with the help
  364. of technical support by AMD. Moreover, measurements show that dedicated
  365. connecting hardware must be employed in low latency applications.
  366. In order to increase the total throughput, a custom FPGA evaluation board is
  367. currently under development. The board mounts a Virtex-7 chip and features two
  368. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
  369. 3.0 connection. Two PCIe x8 3.0 cores, instantiated on the board, will be mapped
  370. as a single x16 device by using an external PCIe switch. With two cores
  371. operating in parallel, we foresee an increase in the data throughput by a
  372. factor of two as demonstrated in~\cite{rota2015dma}.
  373. The proposed software solution allows seamless multi-GPU processing of the
  374. incoming data, due to the integration in our streamed computing framework. This
  375. allows straightforward integration with different DAQ systems and introduction
  376. of custom data processing algorithms. Support for NVIDIA's GPUDirect technology
  377. is also foreseen in the next months to lift the limitation of one specific GPU
  378. vendor and compare the performance of hardware by different vendors. Further
  379. improvements are expected by generalizing the transfer mechanism and include
  380. InfiniBand support besides the existing PCIe connection.
  381. Our goal is to develop a unique hybrid solution, based on commercial standards,
  382. that includes fast data transmission protocols and a high performance GPU
  383. computing framework.
  384. \acknowledgments
  385. This work was partially supported by the German-Russian BMBF funding programme,
  386. grant numbers 05K10CKB and 05K10VKE.
  387. \bibliographystyle{JHEP}
  388. \bibliography{literature}
  389. \end{document}