paper.tex 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348
  1. \documentclass{JINST}
  2. \usepackage[utf8]{inputenc}
  3. \usepackage{lineno}
  4. \usepackage{ifthen}
  5. \newboolean{draft}
  6. \setboolean{draft}{true}
  7. \newcommand{\figref}[1]{Figure~\ref{#1}}
  8. \title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
  9. \author{N.~Zilio$^b$,
  10. M.~Weber$^a$\\
  11. \llap{$^a$}Institute for Data Processing and Electronics,\\
  12. Karlsruhe Institute of Technology (KIT),\\
  13. Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany\\
  14. \llap{$^b$}Somewhere in France eating Pate
  15. }
  16. \abstract{%
  17. Modern physics experiments have reached multi-GB/s data rates. Fast data
  18. links and high performance computing stages are required for continuous
  19. acquisition and processing. Because of their intrinsic parallelism and
  20. % I would remove the computing from here and leave "ideal solution",
  21. % afterwards we have again computing...
  22. computational power, GPUs emerged as an ideal computing solution for high
  23. performance computing applications. To connect a fast data acquisition stage
  24. with a GPU's processing power, we developed an architecture consisting of a
  25. FPGA that includes a Direct Memory Access (DMA) engine compatible with the
  26. Xilinx PCI-Express core, a Linux driver for register access and high-level
  27. software to manage direct memory transfers using AMD's DirectGMA technology.
  28. Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
  29. implementation is suitable for real-time DAQ system applications ranging
  30. photon science and medical imaging to HEP experiment triggers.
  31. }
  32. \begin{document}
  33. \ifdraft
  34. \setpagewiselinenumbers
  35. \linenumbers
  36. \fi
  37. \section{Motivation}
  38. GPU computing has become the main driving force for high performance computing
  39. due to an unprecedented parallelism and a low cost-benefit factor. GPU
  40. acceleration has found its way into numerous applications, ranging from
  41. simulation to image processing. Recent years have also seen an increasing
  42. interest in GPU-based systems for HEP applications, which require a combination
  43. of high data rates, high computational power and low latency (\emph{e.g.}
  44. ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
  45. PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
  46. science facilities have become comparable to those traditionally associated with
  47. HEP.
  48. In HEP experiments data is acquired by one or more read-out boards and then
  49. transmitted to GPUs in short bursts or in a continuous streaming mode. With
  50. expected data rates of several GB/s, the data transmission link between the
  51. read-out boards and the host system may partially limit the overall system
  52. performance. In particular, latency becomes the most stringent specification if
  53. a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
  54. To address these problems we propose a complete hardware/software stack
  55. architecture based on our own Direct Memory Access (DMA) design and integration
  56. of AMD's DirectGMA technology into our processing pipeline. In our solution,
  57. PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
  58. host computer. Due to its high bandwidth and modularity, PCIe quickly became the
  59. commercial standard for connecting high-throughput peripherals such as GPUs or
  60. solid state disks. Optical PCIe networks have been demonstrated
  61. % JESUS: time span -> for, point in time -> since ...
  62. % BUDDHA: Ok boss. I wanted to say "since 10 years ago...", is for ok?
  63. for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
  64. as a communication bus over long distances. In particular, in HEP DAQ systems,
  65. optical links are preferred over electrical ones because of their superior
  66. radiation hardness, lower power consumption and higher density.
  67. %% Added some more here, I need better internet to find the correct references
  68. Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
  69. PCIe network interface card with NVIDIA's GPUDirect
  70. integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
  71. at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
  72. a commercial PCIe engine. Other solutions achieve higher throughput based on
  73. Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
  74. they do not provide support for direct FPGA-GPU communication.
  75. \section{Architecture}
  76. DMA data transfers are handled by dedicated hardware, which compared with
  77. Programmed Input Output (PIO) access, offer lower latency and higher throughput
  78. at the cost of higher system complexity.
  79. \begin{figure}[t]
  80. \centering
  81. \includegraphics[width=1.0\textwidth]{figures/transf}
  82. \caption{%
  83. In a traditional DMA architecture (a), data is first written to the main
  84. system memory and then sent to the GPUs for final processing. By using
  85. GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
  86. GPU's internal memory.
  87. }
  88. \label{fig:trad-vs-dgpu}
  89. \end{figure}
  90. As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
  91. data through system main memory by copying data from the FPGA into intermediate
  92. buffers and then finally into the GPU's main memory. Thus, the total throughput
  93. of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
  94. AMD's DirectGMA technologies allow direct communication between GPUs and
  95. auxiliary devices over the PCIe bus. By combining this technology with a DMA
  96. data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
  97. system is reduced and total throughput increased. Moreover, the CPU and main
  98. system memory are relieved from processing because they are not directly
  99. involved in the data transfer anymore.
  100. \subsection{DMA engine implementation on the FPGA}
  101. We have developed a DMA architecture that minimizes resource utilization while
  102. maintaining the flexibility of a Scatter-Gather memory
  103. policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
  104. IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
  105. main system memory and GPU memory are both supported. Two FIFOs, with a data
  106. width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
  107. the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
  108. a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
  109. with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
  110. configured by the host through PIO registers.
  111. The physical addresses of the host's memory buffers are stored into an internal
  112. memory and are dynamically updated by the driver or user, allowing highly
  113. efficient zero-copy data transfers. The maximum size associated with each
  114. address is 2 GB.
  115. \subsection{OpenCL management on host side}
  116. \label{sec:host}
  117. On the host side, AMD's DirectGMA technology, an implementation of the
  118. bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
  119. the FPGA to GPU memory and from the GPU to the FPGA's control registers.
  120. \figref{fig:opencl-setup} illustrates the main mode of operation: to write into
  121. the GPU, the physical bus addresses of the GPU buffers are determined with a call to
  122. \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
  123. control register of the FPGA (1). The FPGA then writes data blocks autonomously
  124. in DMA fashion (2).
  125. % BUDDHA: This part is not true. We need to always do the handshaking if we transfer
  126. % more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
  127. Due to hardware restrictions the largest possible GPU buffer
  128. sizes are about 95 MB but larger transfers can be achieved using a double
  129. buffering mechanism. Because the GPU provides a flat memory address space and
  130. our DMA engine allows multiple destination addresses to be set in advance, we
  131. can determine all addresses before the actual transfers thus keeping the
  132. CPU out of the transfer loop.
  133. %% BUDDHA: the CPU is still involved in the loop at the moment. We didn't manage
  134. % to move the handshaking completely to the GPU, did we?
  135. To signal events to the FPGA (4), the control registers can be mapped into the
  136. GPU's address space passing a special AMD-specific flag and passing the physical
  137. BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
  138. function. From the GPU, this memory is seen transparently and as regular GPU
  139. memory and can be written accordingly (3). Individual write accesses are issued
  140. as PIO commands, however using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
  141. call it is also possible to write entire memory regions in a DMA fashion to the
  142. FPGA. In this case, the GPU acts as bus master to push data to the FPGA.
  143. \begin{figure}
  144. \centering
  145. \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  146. \caption{The FPGA writes to GPU memory by mapping the physical address of a
  147. GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  148. mapping the FPGA control registers into the address space of the GPU.}
  149. \label{fig:opencl-setup}
  150. \end{figure}
  151. To process the data, we encapsulated the DMA setup and memory mapping in a
  152. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
  153. framework allows for an easy construction of streamed data processing on
  154. heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
  155. its specific format and run a Fourier transform on the GPU as well as writing
  156. back the results to disk, one can run on the command line:
  157. % BUDDHA: I like this point very very much, formatting helps to make it stand out
  158. \begin{verbatim}ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
  159. \end{verbatim} %%
  160. The framework will take care of scheduling the tasks and distribute the data items
  161. according. A complementary application programming interface allows users to
  162. develop custom applications written in C or high-level languages such as Python.
  163. High throughput is achieved by the combination of fine- and coarse-grained data
  164. parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
  165. of threads and by splitting the data stream and feeding individual data items to
  166. separate GPUs. None of this requires any user intervention and is solely
  167. determined by the framework in an automatized fashion.
  168. \section{Results}
  169. We measured the performance using a Xilinx VC709 evaluation board plugged into a
  170. desktop PC with an Intel Xeon E5-1630 3.7 GHz processor and an Intel C612
  171. chipset.
  172. Due to the size limitation of the DMA buffer as presented in Section
  173. \ref{sec:host}, we have to copy several sub buffers in order to transfer data
  174. larger than the maximum transfer size of 95 MB. In \figref{fig:intra-copy}, the
  175. throughput for a copy from a smaller sized buffer (representing the DMA buffer)
  176. to a larger buffer is shown. At a block size of about 384 KB, the throughput
  177. surpasses the maximum possible PCIe bandwidth, thus making a double buffering
  178. strategy a viable solution for very large data transfers.
  179. \begin{figure}
  180. \includegraphics[width=\textwidth]{figures/intra-copy}
  181. \caption{%
  182. Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
  183. (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
  184. performance for smaller block sizes is caused by the larger amount of
  185. transfers required to fill the destination buffer. The throughput has been
  186. estimated using the host side wall clock time. On-GPU data transfer is about
  187. twice as fast.
  188. %% BUDDHA: forgive my ignorance: what does it mean "on-gpu"?
  189. }
  190. \label{fig:intra-copy}
  191. \end{figure}
  192. \subsection{Throughput}
  193. %% BUDDHA: why do we need to state this thing? High-throughput affects also the
  194. %% total latency. One can optimize for one or the other probably, but at the moment
  195. %% we use the same approach, so I would not write this.
  196. A high throughput is desired for applications in which the FPGA outputs large
  197. amounts of data and timing is not an issue. This includes fast, high resolution
  198. photon detectors as used in synchrotron facilities.
  199. \figref{fig:throughput} shows the memory write throughput for a GPU and the CPU
  200. For both system and GPU memory, the write performance is primarily limited by
  201. the PCIe bus. Higher payloads introduce less overhead, thus increasing the net
  202. bandwidth. Up until 2 MB transfer size, the performance is almost the same,
  203. after that the GPU transfer shows a slightly better slope. Data transfers larger
  204. than 1 GB saturate the PCIe bus. \textbf{LR: We should measure the slope for
  205. different page sizes, I expect the saturation point to change for different
  206. page sizes}
  207. \begin{figure}
  208. \centering
  209. \includegraphics[width=1.0\textwidth]{figures/throughput}
  210. \caption{
  211. Throughput of regular CPU and our GPU DMA data transfer for up to 50 GB of
  212. data.
  213. }
  214. \label{fig:throughput}
  215. \end{figure}
  216. \subsection{Latency}
  217. %% Change the specs for the small crate
  218. % MV: we never did anything in that regard
  219. % LR: Nicholas did, and he said there was no difference in FPGA-GPU
  220. % For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
  221. % based on XXX and Intel Nano XXXX. The results does not show any significant difference
  222. % compared to the previous setup, making it a more cost-effective solution.
  223. \begin{figure}
  224. \includegraphics[width=\textwidth]{figures/latency-michele}
  225. \caption{%
  226. Relative frequency of measured latencies for a single 4 KB packet transfered
  227. from the GPU to the FPGA.
  228. }
  229. \label{fig:intra-copy}
  230. \end{figure}
  231. %% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
  232. %% Here: instead of this useless plot, we can plot the latency vs different data sizes transmitted (from FPGA). It should reach 50% less for large data transfers, even with our current limitation... Maybe we can also try on a normal desktop ?
  233. \begin{figure}
  234. \centering
  235. \includegraphics[width=0.6\textwidth]{figures/latency}
  236. \caption{%
  237. For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
  238. }
  239. \label{fig:latency}
  240. \end{figure}
  241. % In case everything is fine.
  242. \ref{fig:latency} shows the comparison between the traditional approach and GPU DMA data transfer.
  243. The total latency is decreased
  244. The distribution of latency is shown in Figure \ref{fig:latency}.
  245. %% EMERGENCY TEXT if we don't manage to fix the latency problem
  246. The round-trip time of a memory read request issued from the CPU to the FPGA is less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the execution of DirectGMA functions.
  247. \section{Conclusion}
  248. We developed a complete hardware and software solution that enables DMA
  249. transfers between FPGA-based readout boards and GPU computing clusters. The net
  250. throughput is primarily limited by the PCIe bus, reaching 6.x GB/s for a 256 B
  251. payload. By writing directly into GPU memory instead of routing data through
  252. system main memory, the overall latency is reduced by a factor of 2. Moreover,
  253. the solution proposed here allows high performance GPU computing due to
  254. integration of the DMA transfer setup in our streamed computing framework.
  255. Integration with different DAQ systems and custom algorithms is therefore
  256. immediate.
  257. \subsection{Outlook}
  258. %Add if we cannot fix latency
  259. An optimization of the OpenCL code in ongoing, with the help of AMD technical support.
  260. With a better understanding of the hardware and software aspects of DirectGMA, we expect
  261. a significant improvement in latency performance.
  262. Support for NVIDIA's GPUDirect technology is foreseen in the next months to
  263. lift the limitation of one specific GPU vendor and direct performance comparison will be possible.
  264. A custom FPGA evaluation board is currently under development in order to
  265. increase the total throughput. The board mounts a Virtex-7 chip and features 2
  266. fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
  267. x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
  268. single x16 device by using an external PCIe switch. With two cores operating in parallel,
  269. we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
  270. \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
  271. A big house for all these love-lacking protocols.}
  272. It is our intention to add Infiniband support.
  273. \textbf{I NEED TO READ
  274. WHAT ARE THE ADVANTAGES VS PCIe. Update: internet sucks in China.}
  275. %LR:Here comes the visionary Luigi
  276. Our goal is to develop a unique hybrid solution, based
  277. on commercial standards, that includes fast data transmission protocols and a high performance
  278. GPU computing framework.
  279. \acknowledgments
  280. UFO? KSETA? Are you joking? Nope? You have to credit funding? Mafia.
  281. \bibliographystyle{JHEP}
  282. \bibliography{literature}
  283. \end{document}