paper.tex 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \usepackage{caption}
  6. \usepackage{subcaption}
  7. \usepackage{textcomp}
  8. \usepackage{booktabs}
  9. \usepackage{floatrow}
  10. \newfloatcommand{capbtabbox}{table}[][\FBwidth]
  11. \newboolean{draft}
  12. \setboolean{draft}{true}
  13. \newcommand{\figref}[1]{Figure~\ref{#1}}
  14. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  15. \author{
  16. L.~Rota$^a$,
  17. M.~Vogelgesang$^a$,
  18. L.E.~Ardila Perez$^a$,
  19. M.~Caselle$^a$,
  20. S.~Chilingaryan$^a$,
  21. T.~Dritschler$^a$,
  22. N.~Zilio$^a$,
  23. A.~Kopmann$^a$,
  24. M.~Balzer$^a$,
  25. M.~Weber$^a$\\
  26. \llap{$^a$}Institute for Data Processing and Electronics,\\
  27. Karlsruhe Institute of Technology (KIT),\\
  28. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
  29. E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
  30. }
  31. \abstract{ Modern physics experiments have reached multi-GB/s data rates. Fast
  32. data links and high performance computing stages are required for continuous
  33. data acquisition and processing. Because of their intrinsic parallelism and
  34. computational power, GPUs emerged as an ideal solution to process this data in
  35. high performance computing applications. In this paper we present a high-
  36. throughput platform based on direct FPGA-GPU communication. The
  37. architecture consists of a Direct Memory Access (DMA) engine compatible with
  38. the Xilinx PCI-Express core, a Linux driver for register access, and high-
  39. level software to manage direct memory transfers using AMD's DirectGMA
  40. technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
  41. for transfers to GPU memory and 6.6~GB/s to system memory. We also assesed
  42. the possibility of using our architecture in low latency systems: preliminary
  43. measurements show a round-trip latency as low as 1 \textmu s for data
  44. transfers to system memory, while the additional latency introduced by OpenCL
  45. scheduling is the current limitation for GPU based systems. Our
  46. implementation is suitable for real- time DAQ system applications ranging from
  47. photon science and medical imaging to High Energy Physics (HEP) systems.}
  48. \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
  49. \begin{document}
  50. \ifdraft
  51. \setpagewiselinenumbers
  52. \linenumbers
  53. \fi
  54. \section{Introduction}
  55. GPU computing has become the main driving force for high performance computing
  56. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  57. acceleration has found its way into numerous applications, ranging from
  58. simulation to image processing.
  59. The data rates of bio-imaging or beam-monitoring experiments running in
  60. current generation photon science facilities have reached tens of
  61. GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
  62. back-end readout systems and then transmitted in short bursts or in a
  63. continuous streaming mode to a computing stage. In order to collect data over
  64. long observation times, the readout architecture and the computing stages must
  65. be able to sustain high data rates.
  66. Recent years have also seen an increasing interest in GPU-based systems for
  67. High Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
  68. ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
  69. photon science experiments. In time-deterministic applications,\emph{e.g.} in
  70. Low/High-level trigger systems, latency becomes the most stringent
  71. requirement.
  72. Due to its high bandwidth and modularity, PCIe quickly became the commercial
  73. standard for connecting high-throughput peripherals such as GPUs or solid
  74. state disks. Moreover, optical PCIe networks have been demonstrated a decade
  75. ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
  76. communication link over long distances.
  77. Several solutions for direct FPGA/GPU communication based on PCIe are reported
  78. in literature, and all of them are based on NVIDIA's GPUdirect technology. In
  79. the implementation of bittnerner and Ruf ~\cite{bittner} the GPU acts as
  80. master during an FPGA-to-GPU data transfer, reading data from the FPGA. This
  81. solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
  82. s, respectively.
  83. %LR: FPGA^2 it's the name of their thing...
  84. %MV: best idea in the world :)
  85. %LR: Let's call ours FPGA^2_GPU
  86. When the FPGA is used as a master, a higher throughput can be achieved. An
  87. example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
  88. et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
  89. Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
  90. PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
  91. limits the latency performance of the system to a few tens of \textmu s. If
  92. only the FPGA-to-GPU latency is considered, the measured values span between
  93. 1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
  94. bandwidth saturates at 120 MB/s. Nieto et~al.\ presented a system based on a
  95. PXIexpress data link that makes use of four PCIe 1.0
  96. links~\cite{nieto2015high}. Their system (as limited by the interconnect)
  97. achieves an average throughput of 870 MB/s with 1 KB block transfers.
  98. In order to achieve the best performance in terms of latency and bandwidth, we
  99. developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
  100. process the data, we encapsulated the DMA setup and memory mapping in a plugin
  101. for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  102. framework allows for an easy construction of streamed data processing on
  103. heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
  104. integration with NVIDIA's CUDA functions for GPUDirect technology is not
  105. possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
  106. direct FPGA-to-GPU communication into our processing pipeline. In this paper
  107. we report the performance of our DMA engine for FPGA-to-CPU communication and
  108. some preliminary measurements about DirectGMA's performance in low-latency
  109. applications.
  110. %% LR: this part -> OK
  111. \section{Architecture}
  112. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  113. data through system main memory by copying data from the FPGA into
  114. intermediate buffers and then finally into the GPU's main memory. Thus, the
  115. total throughput and latency of the system is limited by the main memory
  116. bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
  117. communication between GPUs and auxiliary devices over PCIe. By combining this
  118. technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
  119. overall latency of the system is reduced and total throughput increased.
  120. Moreover, the CPU and main system memory are relieved from processing because
  121. they are not directly involved in the data transfer anymore.
  122. \begin{figure}[t]
  123. \centering
  124. \includegraphics[width=1.0\textwidth]{figures/transf}
  125. \caption{%
  126. In a traditional DMA architecture (a), data are first written to the main
  127. system memory and then sent to the GPUs for final processing. By using
  128. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  129. the GPU's internal memory.
  130. }
  131. \label{fig:trad-vs-dgpu}
  132. \end{figure}
  133. %% LR: this part -> Text:OK, Figure: must be updated
  134. \subsection{DMA engine implementation on the FPGA}
  135. We have developed a DMA engine that minimizes resource utilization while
  136. maintaining the flexibility of a Scatter-Gather memory
  137. policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}. The engine is compatible with the Xilinx PCIe
  138. Gen2/3 IP- Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA data
  139. transfers to/from main system memory and GPU memory are supported. Two FIFOs,
  140. with a data width of 256 bits and operating at 250 MHz, act as user- friendly
  141. interfaces with the custom logic with an input bandwidth of 7.45 GB/s. The
  142. user logic and the DMA engine are configured by the host through PIO
  143. registers. The resource
  144. utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
  145. \begin{figure}[t]
  146. \small
  147. \begin{floatrow}
  148. \ffigbox{%
  149. \includegraphics[width=0.4\textwidth]{figures/fpga-arch}
  150. }{%
  151. \caption{A figure}%
  152. \label{fig:fpga-arch}
  153. }
  154. \capbtabbox{%
  155. \begin{tabular}{@{}llll@{}}
  156. \toprule
  157. Resource & Utilization & (\%) \\
  158. \midrule
  159. LUT & 5331 & (1.23) \\
  160. LUTRAM & 56 & (0.03) \\
  161. FF & 5437 & (0.63) \\
  162. BRAM & 21 & (1.39) \\
  163. \bottomrule
  164. \end{tabular}
  165. }{%
  166. \caption{Resource utilization on a xc7vx690t-ffg1761 device}%
  167. \label{table:utilization}
  168. }
  169. \end{floatrow}
  170. \end{figure}
  171. The physical addresses of the host's memory buffers are stored into an internal
  172. memory and are dynamically updated by the driver or user, allowing highly
  173. efficient zero-copy data transfers. The maximum size associated with each
  174. address is 2 GB.
  175. %% LR: -----------------> OK
  176. \subsection{OpenCL management on host side}
  177. \label{sec:host}
  178. \begin{figure}[b]
  179. \centering
  180. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  181. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  182. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  183. mapping the FPGA control registers into the address space of the GPU.}
  184. \label{fig:opencl-setup}
  185. \end{figure}
  186. %% Description of figure
  187. On the host side, AMD's DirectGMA technology, an implementation of the bus-
  188. addressable memory extension for OpenCL 1.1 and later, is used to write from
  189. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  190. \figref{fig:opencl-setup} illustrates the main mode of operation: to write
  191. into the GPU, the physical bus addresses of the GPU buffers are determined
  192. with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
  193. host CPU in a control register of the FPGA (1). The FPGA then writes data
  194. blocks autonomously in DMA fashion (2). To signal events to the FPGA (4), the
  195. control registers can be mapped into the GPU's address space passing a special
  196. AMD-specific flag and passing the physical BAR address of the FPGA
  197. configuration memory to the \texttt{cl\-Create\-Buffer} function. From the
  198. GPU, this memory is seen transparently as regular GPU memory and can be
  199. written accordingly (3). In our setup, trigger registers are used to notify
  200. the FPGA on successful or failed evaluation of the data. Using the
  201. \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write
  202. entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
  203. as bus master and pushes data to the FPGA.
  204. %% Double Buffering strategy.
  205. Due to hardware restrictions the largest possible GPU buffer sizes are about 95
  206. MB but larger transfers can be achieved by using a double buffering mechanism.
  207. data are copied from the DirectGMA buffer exposed to the FPGA into a different
  208. GPU buffer. To verify that we can keep up with the incoming data throughput
  209. using this strategy, we measured the data throughput within a GPU by copying
  210. data from a smaller sized buffer representing the DMA buffer to a larger
  211. destination buffer. At a block size of about 384 KB the throughput surpasses the
  212. maximum possible PCIe bandwidth, and it reaches 40 GB/s for blocks bigger than 5
  213. MB. Double buffering is therefore a viable solution for very large data
  214. transfers, where throughput performance is favoured over latency. For data sizes
  215. less than 95 MB, we can determine all addresses before the actual transfers thus
  216. keeping the CPU out of the transfer loop.
  217. %% Ufo Framework
  218. To process the data, we encapsulated the DMA setup and memory mapping in a
  219. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
  220. This framework allows for an easy construction of streamed data processing on
  221. heterogeneous multi-GPU systems. For example, to read data from the FPGA,
  222. decode from its specific data format and run a Fourier transform on the GPU as
  223. well as writing back the results to disk, one can run the following on the
  224. command line:
  225. \begin{verbatim}
  226. ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  227. \end{verbatim}
  228. The framework takes care of scheduling the tasks and distributing the data
  229. items to one or more GPUs. High throughput is achieved by the combination of
  230. fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
  231. data item on a GPU using thousands of threads and by splitting the data stream
  232. and feeding individual data items to separate GPUs. None of this requires any
  233. user intervention and is solely determined by the framework in an automatized
  234. fashion. A complementary application programming interface allows users to
  235. develop custom applications written in C or high-level languages such as
  236. Python.
  237. %% --------------------------------------------------------------------------
  238. \section{Results}
  239. \begin{table}[b]
  240. \centering
  241. \small
  242. \caption{Setups used for throughput and latency measurements}
  243. \label{table:setups}
  244. \tabcolsep=0.11cm
  245. \begin{tabular}{@{}llll@{}}
  246. \toprule
  247. & Setup 1 & Setup 2 \\
  248. \midrule
  249. CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
  250. Chipset & Intel C612 & Intel ICH9R Express \\
  251. GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
  252. PCIe slot: System memory & x8 Gen3 & x4 Gen1 \\
  253. PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
  254. \bottomrule
  255. \end{tabular}
  256. \end{table}
  257. We carried out performance measurements on two different setups, which are
  258. described in table~\ref{table:setups}. In both setups, a Xilinx VC709
  259. evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
  260. into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
  261. (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
  262. Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
  263. were connected to the same RC, as opposed to Setup 1. As stated in the
  264. NVIDIA's GPUDirect documentation, the devices must share the same RC to
  265. achieve the best performance~\cite{cuda_doc}. In case of FPGA-to-CPU data
  266. transfers, the software implementation is the one described
  267. in~\cite{rota2015dma}.
  268. %% --------------------------------------------------------------------------
  269. \subsection{Throughput}
  270. \begin{figure}[t]
  271. \includegraphics[width=0.85\textwidth]{figures/throughput}
  272. \caption{%
  273. Measured throughput for data transfers from FPGA to main memory
  274. (CPU) and from FPGA to the global GPU memory (GPU).
  275. }
  276. \label{fig:throughput}
  277. \end{figure}
  278. In order to evaluate the maximum performance of the DMA engine, measurements of pure
  279. data throughput were carried out using Setup 1. The results are shown in
  280. \figref{fig:throughput} for transfers to the system's main memory as well as
  281. to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
  282. double buffering mechanism was used. As one can see, in both cases the write
  283. performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
  284. size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
  285. the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
  286. throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
  287. and maximum performance depend on the different implementation of the
  288. handshaking sequence between DMA engine and the hosts.
  289. %% --------------------------------------------------------------------------
  290. \subsection{Latency}
  291. \begin{figure}[t]
  292. \centering
  293. \begin{subfigure}[b]{.49\textwidth}
  294. \centering
  295. \includegraphics[width=\textwidth]{figures/latency-cpu}
  296. \label{fig:latency-cpu}
  297. \vspace{-0.4\baselineskip}
  298. \caption{}
  299. \end{subfigure}
  300. \begin{subfigure}[b]{.49\textwidth}
  301. \includegraphics[width=\textwidth]{figures/latency-gpu}
  302. \label{fig:latency-gpu}
  303. \vspace{-0.4\baselineskip}
  304. \caption{}
  305. \end{subfigure}
  306. \caption{Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).}
  307. \label{fig:latency}
  308. \end{figure}
  309. We conducted the following test in order to measure the latency introduced by the DMA engine :
  310. 1) the host starts a DMA transfer by issuing the \emph{start\_dma} command.
  311. 2) the DMA engine transmits data into the system main memory.
  312. 3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
  313. 4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
  314. The correct ordering of the packets is assured by the PCIe protocol.
  315. A counter on the FPGA measures the time interval between the \emph{start\_dma}
  316. and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
  317. the round-trip latency of the system. The round-trip latencies for data
  318. transfers to system main memory and GPU memory are shown in
  319. \figref{fig:latency}.
  320. When system main memory is used,
  321. latencies as low as 1.1 \textmu s are achieved with Setup 1 for a packet size
  322. of 1024 B. The higher latency and the dependance on size measured with Setup 2
  323. are caused by the slower PCIe x4 Gen1 link connecting the FPGA board to the system main memory.
  324. The same test was performed when transferring data inside GPU memory, but also
  325. in this case the notification is written to systen main memory. This approach
  326. was used because the latency introduced by OpenCL scheduling (\~ 100 \textmu
  327. s) does not allow for a direct measurement based only on DirectGMA
  328. communication. When connecting the devices to the same RC, as in Setup 2, a
  329. latency of 2 \textmu is achieved (limited by the latency to system main
  330. memory, as seen in \figref{fig:latency}.a. On the contrary, if the FPGA board
  331. and the GPU are connected to different RC as in Setup 1, the latency increases
  332. significantly. It must be noted that the low latencies measured with Setup 1
  333. for packet sizes below 1 kB seem to be due to a caching mechanism inside the
  334. PCIe switch, and it is not clear whether data has been successfully written
  335. into GPU memory when the notification is delivered to the CPU. This effect
  336. must be taken into account for future implementations as it could potentially
  337. lead to data corruption.
  338. \section{Conclusion and outlook}
  339. We developed a hardware and software solution that enables DMA transfers
  340. between FPGA-based readout systems and GPU computing clusters.
  341. The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
  342. for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
  343. data transfer. The measurements on a low-end system based on an Intel Atom CPU
  344. showed no significant difference in throughput performance. Depending on the
  345. application and computing requirements, this result makes smaller acquisition
  346. system a cost-effective alternative to larger workstations.
  347. We measured a round-trip latency of 1 \textmu s when transfering data between
  348. the DMA engine with system main memory. We also assessed the applicability of
  349. DirectGMA in low latency applications: preliminary results shows that
  350. latencies as low as 2 \textmu s can by achieved during data transfers to GPU
  351. memory. However, at the time of writing this paper, the latency introduced by
  352. OpenCL scheduling is in the range of hundreds of \textmu s. Optimization of
  353. the GPU-DMA interfacing OpenCL code is ongoing with the help of technical
  354. support by AMD, in order to lift the current limitation and enable the use of
  355. our implementation in low latency applications. Moreover, measurements show
  356. that dedicated hardware must be employed in low latency applications.
  357. In order to increase the total throughput, a custom FPGA evaluation board is
  358. currently under development. The board mounts a Virtex-7 chip and features two
  359. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
  360. Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
  361. as a single x16 device by using an external PCIe switch. With two cores
  362. operating in parallel, we foresee an increase in the data throughput by a
  363. factor of 2 (as demonstrated in~\cite{rota2015dma}).
  364. The proposed software solution allows seamless multi-GPU processing of
  365. the incoming data, due to the integration in our streamed computing framework.
  366. This allows straightforward integration with different DAQ systems and
  367. introduction of custom data processing algorithms.
  368. Support for NVIDIA's GPUDirect technology is also foreseen in the next months
  369. to lift the limitation of one specific GPU vendor and compare the performance
  370. of hardware by different vendors. Further improvements are expected by
  371. generalizing the transfer mechanism and include Infiniband support besides the
  372. existing PCIe connection.
  373. Our goal is to develop a unique hybrid solution,
  374. based on commercial standards, that includes fast data transmission protocols
  375. and a high performance GPU computing framework.
  376. \acknowledgments
  377. This work was partially supported by the German-Russian BMBF funding programme,
  378. grant numbers 05K10CKB and 05K10VKE.
  379. \bibliographystyle{JHEP}
  380. \bibliography{literature}
  381. \end{document}