paper.tex 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442
  1. \documentclass[journal]{IEEEtran}
  2. \usepackage{pgfplots}
  3. \usepackage{xspace}
  4. \usepackage{todonotes}
  5. \usepackage{minted}
  6. \usepackage{gotham}
  7. \usetikzlibrary{arrows}
  8. \usetikzlibrary{calc}
  9. \usetikzlibrary{fit}
  10. \usepgfplotslibrary{statistics}
  11. %{{{ Meta data
  12. \title{General Purpose FPGA-GPU Platform for High-Throughput DAQ and Processing}
  13. \author{
  14. Matthias Vogelgesang\\
  15. }
  16. %}}}
  17. \newcommand{\figref}[1]{Figure~\ref{#1}}
  18. \begin{document}
  19. \maketitle
  20. %{{{ Abstract
  21. \begin{abstract}
  22. % Motivation
  23. % Problem
  24. Current generation GPUs are capable of processing TFLOP/s which in turn makes
  25. application with large bandwidth requirements and simple algorithms
  26. I/O-bound. Applications that receive data from external data sources are hit
  27. twice because data first has to be transferred into system main memory before
  28. being moved to the GPU in a second transfer.
  29. % Solution
  30. To remedy this problem, we designed and implemented a system architecture
  31. comprising a custom FPGA board with a flexible DMA transfer policy and a
  32. heterogeneous compute framework receiving data using AMD's DirectGMA
  33. OpenCL extension.
  34. % Conclusion
  35. With our proposed system architecture we are able to sustain the bandwidth
  36. requirements of various applications such as real-time tomographic image
  37. reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s.
  38. \end{abstract}
  39. %}}}
  40. \begin{IEEEkeywords}
  41. GPGPU
  42. \end{IEEEkeywords}
  43. \section{Introduction}
  44. GPU computing has become a cornerstone for manifold applications that require
  45. large computional demands and exhibit an algorithmic pattern with a high degree
  46. of parallelization. This includes signal reconstruction~\cite{ref},
  47. recognition~\cite{ref} and analysis~\cite{emilyanov2012gpu} as well as
  48. simulation~\cite{bussmann2013radiative} and deep
  49. learning~\cite{krizhevsky2012imagenet}. With low costs of purchases and a
  50. relatively straightforward SIMD programming model, GPUs have
  51. become mainstream tools in industry and academia to solve the computational
  52. problems associated with these fields.
  53. Although GPUs harness a memory bandwidth that is far beyond a CPU's access to
  54. system memory, the data transfer between host and GPU can quickly become the
  55. main bottleneck for streaming systems and impede peak computation performance by
  56. not delivering data fast enough. This becomes even worse for systems where data
  57. does not originate from system memory but an external device. Typical examples
  58. delivering high data rates include front-end Field Programmable Gate Arrays
  59. (FPGA) for the digitization of analog signals. In this case, the data crosses
  60. the PCI express (PCIe) bus twice to reach the GPU: once from FPGA to system
  61. memory, second from system memory to GPU device memory. Considering
  62. feedback-driven experiments this data path causes high latencies preventing GPUs
  63. for certain application. On the other hand, copying data twice effectively
  64. halves the total system bandwidth thus ...
  65. In the remaining paper, we will introduce a hardware-software platform that
  66. remedies these issues by decoupling data transfers between FPGA and GPU from the
  67. host machine which is solely used to set up appropriate memory buffers and
  68. orchestrates data transfer and kernel execution initiations. The system is
  69. composed of a custom FPGA with a high performance DMA engine presented in
  70. \ref{sec:fpga} and a high-level software layer that manages the OpenCL runtime
  71. and gives users different means of accessessing the system as shown in
  72. \ref{sec:opencl}. In \ref{sec:use cases}, we outline two example use cases for
  73. our system, both requiring a high data throughput and present benchmark results.
  74. We will discuss and conclude this paper in \ref{sec:discussion} and
  75. \ref{sec:conclusion} respectively.
  76. \section{Streamed data architecture}
  77. \label{sec:architecture}
  78. Besides providing high performance at low power as a co-processor for heavily
  79. parallel and pipelined algorithms, FPGAs are also suited for custom data
  80. acquisition (DAQ) applications because of to lower costs and faster development
  81. time compared to application-specific integrated circuits (ASICs). Data is
  82. streamed from the FPGA to the host machine using a variety of interconnects,
  83. however PCIe is the only viable option for a standardized and high throughput
  84. interconnect~\cite{pci2009specification}.
  85. GPUs typically provide better performance for problems that can be solved using
  86. Single-Instruction-Multiple-Data (SIMD) operations, in highly parallel but
  87. non-pipelined fashion. Compared to FPGAs they also exhibit a simpler programming
  88. model, i.e. algorithm development is much faster. Nevertheless, all data that is
  89. processed on a GPU must be transferred through the PCIe bus.
  90. % Redo combination
  91. Combining those to platforms allow for fast digitization and quick data
  92. assessment. In the following, we will present a hardware/software stack that
  93. encompasses an FPGA DMA engine as well as DirectGMA-based data transfers that
  94. allows us to stream data at peak PCIe bandwidth.
  95. \begin{figure*}
  96. \centering
  97. \begin{tikzpicture}[
  98. box/.style={
  99. draw,
  100. minimum height=6mm,
  101. minimum width=16mm,
  102. text height=1.5ex,
  103. text depth=.25ex,
  104. },
  105. connection/.style={
  106. ->,
  107. >=stealth',
  108. },
  109. ]
  110. \node[box] (adc) {ADC};
  111. \node[box, right=3mm of adc] (logic) {Logic};
  112. \node[box, right=3mm of logic] (fifo) {FIFOs};
  113. \node[box, below=3mm of fifo] (regs) {Registers};
  114. \node[box, right=7cm of fifo] (gpu) {GPU};
  115. \node[box, right=2.7cm of regs] (cpu) {Host CPU};
  116. \node[draw, inner sep=5pt, dotted, fit=(adc) (logic) (fifo) (regs)] {};
  117. \draw[connection] (adc) -- (logic);
  118. \draw[connection] (logic) -- (fifo);
  119. \draw[connection] (cpu) -- node[below] {Set address} (regs);
  120. \draw[connection, <->] (fifo) -- node[above] {Transfer via DMA} (gpu);
  121. \draw[connection, <->] (logic) |- (regs);
  122. \draw[connection] (cpu.355) -| node[below, xshift=-15mm] {Prepare buffers} (gpu.310);
  123. \draw[connection] (gpu.230) |- (cpu.5) node[above, xshift=15mm] {Result};
  124. \end{tikzpicture}
  125. \label{fig:architecture}
  126. \caption{%
  127. Our streaming architecture that consists of a PCIe-based FPGA design with
  128. custom application logic and subsequent data processing on the GPU.
  129. }
  130. \end{figure*}
  131. \subsection{FPGA DMA engine}
  132. \label{sec:fpga}
  133. We have developed a DMA engine that provides a flexible scatter-gather memory
  134. policy and minimizes resource utilization to around 3\% of the resources of a
  135. Virtex-6 device~\cite{rota2015dma}. The engine is compatible with the Xilinx
  136. PCIe 2.0/3.0 IP-Core for Xilinx FPGA families 6 and 7. Both DMA data transfers
  137. between main system memory as well as GPU memory are supported. Two FIFOs, each
  138. 256 bits wide operate at 250 MHz and exchange data with the custom application
  139. logic shown on the left of \figref{fig:architecture}. With this configuration,
  140. the engine is capable of an input bandwidth of 7.45 GB/s. The user logic and
  141. the DMA engine are configured by the host system through 32 bit wide PIO
  142. registers.
  143. Regardless of the actual source of data, DMA transfers are started by writing
  144. one or more physical addresses of the destination memory to a specific register.
  145. The addresses are stored in an internal memory with a size of 4 KB, i.e.
  146. spanning 1024 32-bit or 512 64-bit addresses. Each address may cover a range of
  147. up to 2 GB of linear address space. However, due to the virtual addressing of
  148. current CPU architectures, transfers to the main memory are limited to pages of
  149. 4 KB or 4 MB size. Unlike CPU memory, GPU buffers are flat-addressed and can be
  150. filled at once. Updating the addresses in a dynamic fashion by either the driver
  151. or the host application without fixed addresss, allows for efficient zero-copy
  152. data transfers.
  153. \subsection{OpenCL host management}
  154. \label{sec:opencl}
  155. % Reword copy pasta
  156. On the host side, AMD's DirectGMA technology, an implementation of the
  157. bus-addressable memory extension~\cite{amdbusaddressablememory} for OpenCL 1.1
  158. and later, is used to write from the FPGA to GPU memory and from the GPU to the
  159. FPGA's control registers. \figref{fig:architecture} illustrates the main mode
  160. of operation: to write into the GPU, the physical bus addresses of the GPU
  161. buffers are determined with a call to
  162. \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
  163. control register of the FPGA (1). The FPGA then writes data blocks autonomously
  164. in DMA fashion (2). To signal events to the FPGA (4), the control registers can
  165. be mapped into the GPU's address space passing a special AMD-specific flag and
  166. passing the physical BAR address of the FPGA configuration memory to the
  167. \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is seen
  168. transparently as regular GPU memory and can be written accordingly (3). In our
  169. setup, trigger registers are used to notify the FPGA on successful or failed
  170. evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
  171. call it is possible to write entire memory regions in DMA fashion to the FPGA.
  172. In this case, the GPU acts as bus master and pushes data to the FPGA.
  173. Due to hardware limitations, GPU buffers that are made resident are restricted
  174. to a hardware-dependent size. For example on AMD's FirePro W9100, the total
  175. amount of GPU memory that can be allocated that way is about 95 MB. However,
  176. larger transfers can be achieved by using a double buffering mechanism: data are
  177. copied from the buffer exposed to the FPGA into a different location in GPU
  178. memory. To verify that we can keep up with the incoming data throughput using
  179. this strategy, we measured the data throughput within a GPU by copying data from
  180. a smaller sized buffer representing the DMA buffer to a larger destination
  181. buffer. At a block size of about 384 KB the throughput surpasses the maximum
  182. possible PCIe bandwidth. Block transfers larger than 5 MB saturate the bandwidth
  183. at 40 GB/s. Double buffering is therefore a viable solution for very large data
  184. transfers, where throughput performance is favoured over latency. For data sizes
  185. less than 95 MB, we can determine all addresses before the actual transfers thus
  186. keeping the CPU out of the transfer loop.
  187. \subsection{Heterogenous data processing}
  188. To process the data, we encapsulated the DMA setup and memory mapping in a
  189. plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
  190. This framework allows for an easy construction of streamed data processing on
  191. heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
  192. from its specific data format and run a Fourier transform on the GPU as well as
  193. writing back the results to disk, one can run the following on the command line:
  194. \begin{verbatim}
  195. ufo-launch direct-gma ! decode ! fft ! \
  196. write filename=out.raw
  197. \end{verbatim}
  198. The framework takes care of scheduling the tasks and distributing the data items
  199. to one or more GPUs. High throughput is achieved by the combination of fine- and
  200. coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
  201. GPU using thousands of threads and by splitting the data stream and feeding
  202. individual data items to separate GPUs. None of this requires any user
  203. intervention and is solely determined by the framework in an automatized
  204. fashion. A complementary application programming interface allows users to
  205. develop custom applications in C or high-level languages such as Python. For
  206. example, with a high-level wrapper module users can express the use case
  207. presented in \ref{sec:beam monitoring} like this
  208. \begin{minted}{python}
  209. from ufo import DirectGma, Write
  210. dgma = DirectGma(device='/dev/fpga0')
  211. write = Write(filename='out.raw')
  212. # Execute and wait to finish
  213. write(dgma()).run().join()
  214. \end{minted}
  215. \section{Use cases}
  216. \label{sec:use cases}
  217. % \subsection{Hardware setups}
  218. Based on the architecture covered in Section \ref{sec:architecture} we present
  219. two example use cases motivating a setup involving FPGA-based DAQ and GPU-based
  220. processing. Section \ref{sec:image acquisition} outlines a camera system that
  221. combines frame acquisition with real-time reconstruction of volume data, while
  222. Section \ref{sec:beam monitoring} uses the GPU to determine bunch parameters in
  223. synchrotron beam diagnostics. In both examples, we will describe the setup in
  224. place and subsequently quantify improvements.
  225. We tested the proposed use cases on two different systems representing
  226. high-powered workstations and low-power, embedded systems. In both cases, we
  227. used a frontend FPGA board that is based on a Xilinx VC709 (Virtex-7 FPGA and
  228. PCIe x8 3.0) and an AMD FirePro W9100. System A is based on a Xeon E5-1630 CPU
  229. with an Intel C612 chipset and 128 GB of main memory. Due to the mainboard
  230. layout, both the FPGA and the PCIe devices are connected through different root
  231. complexes (RC). System B is a low-end Supermicro X7SPA-HF-D525 board with an
  232. Intel Atom D525 Dual Core CPU that is connected to an external Netstor NA255A
  233. PCIe enclosure. Unlike System A, the FPGA board and GPU share a common RC
  234. located inside the Netstor box.
  235. \subsection{Image acquisition and reconstruction}
  236. \label{sec:image acquisition}
  237. Custom FPGA logic allows for quick integration of image sensors for application
  238. requirements ranging from high throughput to high resolution as well as initial
  239. pre-processing of the image data. For example, we integrated a CMOS image
  240. sensors such as CMOSIS CMV2000, CMV4000 and CMV20000 on top of the FPGA hardware
  241. platform presented in Section~\ref{sec:fpga}~\cite{caselle2013ultrafast}. These
  242. custom cameras are employed in synchrotron X-ray imaging experiments such as
  243. absorption-based as well as grating-based phase contrast
  244. tomography~\cite{lytaev2014characterization}. Besides merely transmitting the
  245. final frames through PCIe to the host, the FPGA logic is concerned with sensor
  246. configuration, readout and digitization of the analog photon counts.
  247. User-oriented sensor configuration (e.g. exposure time, readout window, etc.)
  248. are mapped to 32-bit registers that are read and written from host.
  249. % Acquiring and processing 2D image data on the fly is a necessary task for many
  250. % control applications.
  251. Before the data can be analyzed, the hardware-specific data format needs to be
  252. decoded. In our case, we have a 10 to 12 bit packed format along with meta
  253. information about the entire frame and per scan line. As shown in
  254. \figref{fig:decoding}, an OpenCL kernel that shifts the pixel information and
  255. discards the meta data is able to decode the frame format efficiently and with a
  256. throughput X times larger than running SSE-optimized code on a Xeon XXX CPU.
  257. Thus decoding a frame before any computation is not impeding bottlenecks and in
  258. fact allows us to process at a lower latency.
  259. \begin{figure}
  260. \centering
  261. \begin{tikzpicture}
  262. \begin{axis}[
  263. gotham/histogram,
  264. width=0.49\textwidth,
  265. height=5cm,
  266. xlabel={Decoding time in ms},
  267. ylabel={Occurence},
  268. bar width=1.2pt,
  269. ]
  270. \addplot file {data/decode/ipecam2.decode.gpu.hist.txt};
  271. \addplot file {data/decode/ipecam2.decode.cpu.hist.txt};
  272. \end{axis}
  273. \end{tikzpicture}
  274. \caption{%
  275. Decoding a range of frames on an AMD FirePro W9100 and a Xeon
  276. XXX.
  277. }
  278. \label{fig:decoding}
  279. \end{figure}
  280. The decoded data is then passed to the next stages that filter the rows in
  281. frequency space and backproject the data into the final volume in real space.
  282. \subsection{Beam monitoring}
  283. \label{sec:beam monitoring}
  284. % Extend motivation
  285. The characterization of an electron beam in synchrotrons is [...] ... Have a
  286. system in place that consists of a 1D spectrum analyzer that outputs 256 values
  287. per acquisition at a frequency of XXX Hz.
  288. The main pipeline consists of background subtraction of previously averaged
  289. background and modulated signals. and ...
  290. \subsection{Results}
  291. \begin{figure}
  292. \centering
  293. \begin{tikzpicture}
  294. \begin{axis}[
  295. height=6cm,
  296. width=\columnwidth,
  297. gotham/line plot,
  298. bar width=5pt,
  299. xtick=data,
  300. x tick label style={
  301. rotate=55,
  302. },
  303. ylabel={Throughput (MB/s)},
  304. symbolic x coords={
  305. 4KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB,
  306. 1MB, 2MB, 4MB, 8MB, 16MB, 32MB
  307. },
  308. legend style={
  309. at={(0.25, 0.95)},
  310. cells={
  311. anchor=west
  312. },
  313. },
  314. ]
  315. \addplot coordinates {
  316. (16KB, 106.178754056)
  317. (32KB, 211.084895305)
  318. (64KB, 415.703896443)
  319. (128KB, 810.339674944)
  320. (256KB, 1547.57365213)
  321. (512KB, 2776.37262474)
  322. (1MB, 5137.62674525)
  323. (2MB, 5915.08598317)
  324. (4MB, 6233.33653831)
  325. (8MB, 6276.50844112)
  326. (16MB, 6305.9174769)
  327. (32MB, 6307.81059127)
  328. };
  329. \addplot coordinates {
  330. (16KB, 112.769066994)
  331. (32KB, 223.614235747)
  332. (64KB, 415.094840869)
  333. (128KB, 758.692184621)
  334. (256KB, 1301.14745592)
  335. (512KB, 2000.44858544)
  336. (1MB, 2726.52144668)
  337. (2MB, 4446.83980882)
  338. (4MB, 4908.10674445)
  339. (8MB, 5155.21548317)
  340. (16MB, 5858.33741922)
  341. (32MB, 5945.28752544)
  342. };
  343. \legend{MT, ST}
  344. \end{axis}
  345. \end{tikzpicture}
  346. \caption{Data throughput from FPGA to GPU on a Setup xyz.}
  347. \label{fig:throughput}
  348. \end{figure}
  349. \section{Discussion}
  350. \label{sec:discussion}
  351. \section{Related work}
  352. \section{Conclusions}
  353. \label{sec:conclusion}
  354. In this paper, we presented a complete data acquisition and processing pipeline
  355. that focuses on low latencies and high throughput. It is based on an FPGA design
  356. for data readout and DMA transmission to host or GPU memory. On the GPU side we
  357. use AMDs DirectGMA OpenCL extension to provide the necessary physical memory
  358. addresses and [we'll see] signaling of data finishes. With this system, we are
  359. able to achieve data rates that match the PCIe specifications of up to 6.x GB/s
  360. for a PCIe 3.0 x8 connection.
  361. \section*{Acknowledgments}
  362. \bibliographystyle{abbrv}
  363. \bibliography{refs}
  364. \end{document}