paper.tex 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308
  1. \documentclass{JINST}
  2. \usepackage{lineno}
  3. \usepackage{ifthen}
  4. \newboolean{draft}
  5. \setboolean{draft}{true}
  6. \newcommand{\figref}[1]{Figure~\ref{#1}}
  7. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  8. \author{N.~Zilio$^b$,
  9. M.~Weber$^a$\\
  10. \llap{$^a$}Institute for Data Processing and Electronics,\\
  11. Karlsruhe Institute of Technology (KIT),\\
  12. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany\\
  13. \llap{$^b$}Somewhere in France eating Pate
  14. }
  15. \abstract{%
  16. %% Old
  17. % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
  18. % data links.}
  19. %proposal for new abstract, including why do we need GPUs
  20. Modern physics experiments have reached multi-GB/s data rates. Fast data
  21. links and high performance computing stages are required for continuous
  22. acquisition and processing. Because of their intrinsic parallelism and
  23. computational power, GPUs emerged as an ideal computing solution for high
  24. performance computing applications. To connect a fast data acquisition stage
  25. with a GPU's processing power, we developed an architecture consisting of a
  26. FPGA that includes a Direct Memory Access (DMA) engine compatible with the
  27. Xilinx PCI-Express core, a Linux driver for register access and high-level
  28. software to manage direct memory transfers using AMD's DirectGMA technology.
  29. Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
  30. implementation is suitable for real-time DAQ system applications ranging
  31. photon science and medical imaging to HEP experiment triggers.
  32. }
  33. \begin{document}
  34. \ifdraft
  35. \setpagewiselinenumbers
  36. \linenumbers
  37. \fi
  38. \section{Motivation}
  39. GPU computing has become the main driving force for high performance computing
  40. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  41. acceleration has found its way into numerous applications, ranging from
  42. simulation to image processing. Recent years have also seen an increasing
  43. interest in GPU-based systems for HEP applications, which require a combination
  44. of high data rates, high computational power and low latency (\emph{e.g.}
  45. ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
  46. PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
  47. science facilities have become comparable to those traditionally associated with
  48. HEP.
  49. In HEP experiments data is acquired by one or more read-out boards and then
  50. transmitted to GPUs in short bursts or in a continuous streaming mode. With
  51. expected data rates of several GB/s, the data transmission link between the
  52. read-out boards and the host system may partially limit the overall system
  53. performance. In particular, latency becomes the most stringent specification if
  54. a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
  55. To address these problems we propose a complete hardware/software stack
  56. architecture based on our own Direct Memory Access (DMA) design and integration
  57. of AMD's DirectGMA technology into our processing pipeline. In our solution,
  58. PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
  59. host computer. Due to its high bandwidth and modularity, PCIe quickly became the
  60. commercial standard for connecting high-throughput peripherals such as GPUs or
  61. solid state disks. Optical PCIe networks have been demonstrated
  62. % JESUS: time span -> for, point in time -> since ...
  63. for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
  64. as a communication bus over long distances. In particular, in HEP DAQ systems,
  65. optical links are preferred over electrical ones because of their superior
  66. radiation hardness, lower power consumption and higher density.
  67. %% Added some more here, I need better internet to find the correct references
  68. Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
  69. PCIe network interface card with NVIDIA's GPUDirect
  70. integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
  71. at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
  72. a commercial PCIe engine. Other solutions achieve higher throughput based on
  73. Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
  74. they do not provide support for direct FPGA-GPU communication.
  75. \section{Architecture}
  76. DMA data transfers are handled by dedicated hardware, which compared with
  77. Programmed Input Output (PIO) access, offer lower latency and higher throughput
  78. at the cost of higher system complexity.
  79. \begin{figure}[t]
  80. \centering
  81. \includegraphics[width=1.0\textwidth]{figures/transf}
  82. \caption{%
  83. In a traditional DMA architecture (a), data is first written to the main
  84. system memory and then sent to the GPUs for final processing. By using
  85. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  86. GPU's internal memory.
  87. }
  88. \label{fig:trad-vs-dgpu}
  89. \end{figure}
  90. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  91. data through system main memory by copying data from the FPGA into intermediate
  92. buffers and then finally into the GPU's main memory. Thus, the total throughput
  93. of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
  94. AMD's DirectGMA technologies allow direct communication between GPUs and
  95. auxiliary devices over the PCIe bus. By combining this technology with a DMA
  96. data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
  97. system is reduced and total throughput increased. Moreover, the CPU and main
  98. system memory are relieved from processing because they are not directly
  99. involved in the data transfer anymore.
  100. \subsection{DMA engine implementation on the FPGA}
  101. We have developed a DMA architecture that minimizes resource utilization while
  102. maintaining the flexibility of a Scatter-Gather memory
  103. policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
  104. IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
  105. main system memory and GPU memory are both supported. Two FIFOs, with a data
  106. width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
  107. the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
  108. a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
  109. with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
  110. configured by the host through PIO registers.
  111. The physical addresses of the host's memory buffers are stored into an internal
  112. memory and are dynamically updated by the driver or user, allowing highly
  113. efficient zero-copy data transfers. The maximum size associated with each
  114. address is 2 GB.
  115. \subsection{OpenCL management on host side}
  116. \label{sec:host}
  117. On the host side, AMD's DirectGMA technology, an implementation of the
  118. bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
  119. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  120. \figref{fig:opencl-setup} illustrates the main mode of operation: To write into
  121. the GPU, the physical bus address of the GPU buffer is determined with a call to
  122. \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
  123. control register of the FPGA (1). The FPGA then writes data blocks autonomously
  124. in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
  125. sizes are about 95 MB but larger transfers can be achieved using a double
  126. buffering mechanism. Because the GPU provides a flat memory address space and
  127. our DMA engine allows multiple destination addresses to be set in advance, we
  128. can determine all addresses before the actual transfers thus keeping the
  129. CPU out of the transfer loop.
  130. To signal events to the FPGA (4), the control registers can be mapped into the
  131. GPU's address space passing a special AMD-specific flag and passing the physical
  132. BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
  133. function. From the GPU, this memory is seen transparently and as regular GPU
  134. memory and can be written accordingly (3).
  135. \begin{figure}
  136. \centering
  137. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  138. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  139. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  140. mapping the FPGA control registers into the address space of the GPU.}
  141. \label{fig:opencl-setup}
  142. \end{figure}
  143. To process the data, we encapsulated the DMA setup and memory mapping in a
  144. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  145. framework allows for an easy construction of streamed data processing on
  146. heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
  147. its specific format and run a Fourier transform on the GPU as well as writing
  148. back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
  149. fft ! write filename=out.raw} on the command line. The framework will take care
  150. of scheduling the tasks and distribute the data items according. A
  151. complementary application programming interface allows users to develop custom
  152. applications written in C or high-level languages such as Python. High
  153. throughput is achieved by the combination of fine- and coarse-grained data
  154. parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
  155. of threads and by splitting the data stream and feeding individual data items to
  156. separate GPUs. None of this requires any user intervention and is solely
  157. determined by the framework in an automatized fashion.
  158. \section{Results}
  159. We measured the performance using a Xilinx VC709 evaluation board plugged into a
  160. desktop PC with an Intel Xeon E5-1630 3.7 GHz processor and an Intel C612
  161. chipset.
  162. Due to the size limitation of the DMA buffer as presented in Section
  163. \ref{sec:host}, we have to copy several sub buffers in order to transfer data
  164. larger than the maximum transfer size of 95 MB. In \figref{fig:intra-copy}, the
  165. throughput for a copy from a smaller sized buffer (representing the DMA buffer)
  166. to a larger buffer is shown. At a block size of about 384 KB, the throughput
  167. surpasses the maximum possible PCIe bandwidth, thus making a double buffering
  168. strategy a viable solution for very large data transfers.
  169. \begin{figure}
  170. \includegraphics[width=\textwidth]{figures/intra-copy}
  171. \caption{%
  172. Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
  173. (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
  174. performance for smaller block sizes is caused by the larger amount of
  175. transfers required to fill the destination buffer. The throughput has been
  176. estimated using the host side wall clock time. On-GPU data transfer is about
  177. twice as fast.
  178. }
  179. \label{fig:intra-copy}
  180. \end{figure}
  181. \subsection{Throughput}
  182. \subsection{Latency}
  183. %% Change the specs for the small crate
  184. For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
  185. based on XXX and Intel Nano XXXX. The results does not show any significant difference
  186. compared to the previous setup, making it a more cost-effective solution.
  187. \begin{figure}
  188. \includegraphics[width=\textwidth]{figures/latency-michele}
  189. \caption{%
  190. FILL ME
  191. }
  192. \label{fig:intra-copy}
  193. \end{figure}
  194. \begin{figure}
  195. \centering
  196. \includegraphics[width=0.6\textwidth]{figures/through_plot}
  197. \caption{
  198. Writing from the FPGA to either system or GPU memory is primarily limited by
  199. the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth.
  200. Up until 2 MB transfer size, the performance is almost the
  201. same, after that the GPU transfer shows a slightly better slope. Data
  202. transfers larger than 1 GB saturate the PCIe bus.
  203. \bf{LR: We should measure the slope for different page sizes, I expect the saturation point
  204. to change for different page sizes}
  205. }
  206. \label{fig:throughput}
  207. \end{figure}
  208. %% Latency here? What do we do?
  209. %% We should add an histogram with 1000+ measurements of the latency to see if it's time-deterministic
  210. %% Also: add a plot of latency vs different data sizes transmitted (from FPGA)
  211. \begin{figure}
  212. \centering
  213. \includegraphics[width=0.6\textwidth]{figures/latency}
  214. \caption{%
  215. The data transmission latency is decreased by XXX percent with respect to the traditional
  216. approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
  217. for a 4k pac
  218. }
  219. \label{fig:latency}
  220. \end{figure}
  221. \section{Conclusion}
  222. %% Added here
  223. We developed a complete hardware and software solution that enable direct DMA transfers
  224. between FPGA-based readout boards and GPU computing clusters. The net throughput is mainly
  225. limited by the PCIe bus, reaching 6.7 GB/s for a 256 B payload. By writing directly into GPU
  226. memory instead of routing data through system main memory, latency is reduced by a factor of 2.
  227. The solution here proposed allows high performance GPU computing thanks to the support of the
  228. framework. Integration with different DAQ systems and custom algorithms is therefore immediate.
  229. \subsection{Outlook}
  230. Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
  231. lift the limitation of one specific GPU vendor and direct performance comparison will be possible.
  232. A custom FPGA evaluation board is currently under development in order to
  233. increase the total throughput. The board mounts a Virtex-7 chip and features 2
  234. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
  235. x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
  236. single x16 device by using an external PCIe switch. With two cores operating in parallel,
  237. we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
  238. \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
  239. A big house for all these love-lacking protocols.}
  240. It is our intention to add Infiniband support. I NEED TO READ
  241. WHAT ARE THE ADVANTAGES VS PCIe.
  242. \textbf{LR:Here comes the visionary Luigi...}
  243. Our goal is to develop a unique hybrid solution, based
  244. on commercial standards, that includes fast data transmission protocols and a high performance
  245. GPU computing framework.
  246. \acknowledgments
  247. UFO? KSETA? Are you joking? Nope? You have to credit funding? Mafia.
  248. \bibliographystyle{JHEP}
  249. \bibliography{literature}
  250. \end{document}