\documentclass[journal]{IEEEtran} \usepackage{pgfplots} \usepackage{xspace} \usepackage{todonotes} \usepackage{minted} \usepackage{gotham} \usetikzlibrary{arrows} \usetikzlibrary{calc} \usetikzlibrary{fit} \usepgfplotslibrary{statistics} %{{{ Meta data \title{General Purpose FPGA-GPU Platform for High-Throughput DAQ and Processing} \author{ Matthias Vogelgesang\\ } %}}} \newcommand{\figref}[1]{Figure~\ref{#1}} \begin{document} \maketitle %{{{ Abstract \begin{abstract} % Motivation % Problem Current generation GPUs are capable of processing TFLOP/s which in turn makes application with large bandwidth requirements and simple algorithms I/O-bound. Applications that receive data from external data sources are hit twice because data first has to be transferred into system main memory before being moved to the GPU in a second transfer. % Solution To remedy this problem, we designed and implemented a system architecture comprising a custom FPGA board with a flexible DMA transfer policy and a heterogeneous compute framework receiving data using AMD's DirectGMA OpenCL extension. % Conclusion With our proposed system architecture we are able to sustain the bandwidth requirements of various applications such as real-time tomographic image reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s. \end{abstract} %}}} \begin{IEEEkeywords} GPGPU \end{IEEEkeywords} \section{Introduction} GPU computing has become a cornerstone for manifold applications that require large computional demands and exhibit an algorithmic pattern with a high degree of parallelization. This includes signal reconstruction~\cite{ref}, recognition~\cite{ref} and analysis~\cite{emilyanov2012gpu} as well as simulation~\cite{bussmann2013radiative} and deep learning~\cite{krizhevsky2012imagenet}. With low costs of purchases and a relatively straightforward SIMD programming model, GPUs have become mainstream tools in industry and academia to solve the computational problems associated with these fields. Although GPUs harness a memory bandwidth that is far beyond a CPU's access to system memory, the data transfer between host and GPU can quickly become the main bottleneck for streaming systems and impede peak computation performance by not delivering data fast enough. This becomes even worse for systems where data does not originate from system memory but an external device. Typical examples delivering high data rates include front-end Field Programmable Gate Arrays (FPGA) for the digitization of analog signals. In this case, the data crosses the PCI express (PCIe) bus twice to reach the GPU: once from FPGA to system memory, second from system memory to GPU device memory. Considering feedback-driven experiments this data path causes high latencies preventing GPUs for certain application. On the other hand, copying data twice effectively halves the total system bandwidth thus ... In the remaining paper, we will introduce a hardware-software platform that remedies these issues by decoupling data transfers between FPGA and GPU from the host machine which is solely used to set up appropriate memory buffers and orchestrates data transfer and kernel execution initiations. The system is composed of a custom FPGA with a high performance DMA engine presented in \ref{sec:fpga} and a high-level software layer that manages the OpenCL runtime and gives users different means of accessessing the system as shown in \ref{sec:opencl}. In \ref{sec:use cases}, we outline two example use cases for our system, both requiring a high data throughput and present benchmark results. We will discuss and conclude this paper in \ref{sec:discussion} and \ref{sec:conclusion} respectively. \section{Streamed data architecture} \label{sec:architecture} Besides providing high performance at low power as a co-processor for heavily parallel and pipelined algorithms, FPGAs are also suited for custom data acquisition (DAQ) applications because of to lower costs and faster development time compared to application-specific integrated circuits (ASICs). Data is streamed from the FPGA to the host machine using a variety of interconnects, however PCIe is the only viable option for a standardized and high throughput interconnect~\cite{pci2009specification}. GPUs typically provide better performance for problems that can be solved using Single-Instruction-Multiple-Data (SIMD) operations, in highly parallel but non-pipelined fashion. Compared to FPGAs they also exhibit a simpler programming model, i.e. algorithm development is much faster. Nevertheless, all data that is processed on a GPU must be transferred through the PCIe bus. % Redo combination Combining those to platforms allow for fast digitization and quick data assessment. In the following, we will present a hardware/software stack that encompasses an FPGA DMA engine as well as DirectGMA-based data transfers that allows us to stream data at peak PCIe bandwidth. \begin{figure*} \centering \begin{tikzpicture}[ box/.style={ draw, minimum height=6mm, minimum width=16mm, text height=1.5ex, text depth=.25ex, }, connection/.style={ ->, >=stealth', }, ] \node[box] (adc) {ADC}; \node[box, right=3mm of adc] (logic) {Logic}; \node[box, right=3mm of logic] (fifo) {FIFOs}; \node[box, below=3mm of fifo] (regs) {Registers}; \node[box, right=7cm of fifo] (gpu) {GPU}; \node[box, right=2.7cm of regs] (cpu) {Host CPU}; \node[draw, inner sep=5pt, dotted, fit=(adc) (logic) (fifo) (regs)] {}; \draw[connection] (adc) -- (logic); \draw[connection] (logic) -- (fifo); \draw[connection] (cpu) -- node[below] {Set address} (regs); \draw[connection, <->] (fifo) -- node[above] {Transfer via DMA} (gpu); \draw[connection, <->] (logic) |- (regs); \draw[connection] (cpu.355) -| node[below, xshift=-15mm] {Prepare buffers} (gpu.310); \draw[connection] (gpu.230) |- (cpu.5) node[above, xshift=15mm] {Result}; \end{tikzpicture} \label{fig:architecture} \caption{% Our streaming architecture that consists of a PCIe-based FPGA design with custom application logic and subsequent data processing on the GPU. } \end{figure*} \subsection{FPGA DMA engine} \label{sec:fpga} We have developed a DMA engine that provides a flexible scatter-gather memory policy and minimizes resource utilization to around 3\% of the resources of a Virtex-6 device~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe 2.0/3.0 IP-Core for Xilinx FPGA families 6 and 7. Both DMA data transfers between main system memory as well as GPU memory are supported. Two FIFOs, each 256 bits wide operate at 250 MHz and exchange data with the custom application logic shown on the left of \figref{fig:architecture}. With this configuration, the engine is capable of an input bandwidth of 7.45 GB/s. The user logic and the DMA engine are configured by the host system through 32 bit wide PIO registers. Regardless of the actual source of data, DMA transfers are started by writing one or more physical addresses of the destination memory to a specific register. The addresses are stored in an internal memory with a size of 4 KB, i.e. spanning 1024 32-bit or 512 64-bit addresses. Each address may cover a range of up to 2 GB of linear address space. However, due to the virtual addressing of current CPU architectures, transfers to the main memory are limited to pages of 4 KB or 4 MB size. Unlike CPU memory, GPU buffers are flat-addressed and can be filled at once. Updating the addresses in a dynamic fashion by either the driver or the host application without fixed addresss, allows for efficient zero-copy data transfers. \subsection{OpenCL host management} \label{sec:opencl} % Reword copy pasta On the host side, AMD's DirectGMA technology, an implementation of the bus-addressable memory extension~\cite{amdbusaddressablememory} for OpenCL 1.1 and later, is used to write from the FPGA to GPU memory and from the GPU to the FPGA's control registers. \figref{fig:architecture} illustrates the main mode of operation: to write into the GPU, the physical bus addresses of the GPU buffers are determined with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a control register of the FPGA (1). The FPGA then writes data blocks autonomously in DMA fashion (2). To signal events to the FPGA (4), the control registers can be mapped into the GPU's address space passing a special AMD-specific flag and passing the physical BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is seen transparently as regular GPU memory and can be written accordingly (3). In our setup, trigger registers are used to notify the FPGA on successful or failed evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts as bus master and pushes data to the FPGA. Due to hardware limitations, GPU buffers that are made resident are restricted to a hardware-dependent size. For example on AMD's FirePro W9100, the total amount of GPU memory that can be allocated that way is about 95 MB. However, larger transfers can be achieved by using a double buffering mechanism: data are copied from the buffer exposed to the FPGA into a different location in GPU memory. To verify that we can keep up with the incoming data throughput using this strategy, we measured the data throughput within a GPU by copying data from a smaller sized buffer representing the DMA buffer to a larger destination buffer. At a block size of about 384 KB the throughput surpasses the maximum possible PCIe bandwidth. Block transfers larger than 5 MB saturate the bandwidth at 40 GB/s. Double buffering is therefore a viable solution for very large data transfers, where throughput performance is favoured over latency. For data sizes less than 95 MB, we can determine all addresses before the actual transfers thus keeping the CPU out of the transfer loop. \subsection{Heterogenous data processing} To process the data, we encapsulated the DMA setup and memory mapping in a plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This framework allows for an easy construction of streamed data processing on heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode from its specific data format and run a Fourier transform on the GPU as well as writing back the results to disk, one can run the following on the command line: \begin{verbatim} ufo-launch direct-gma ! decode ! fft ! \ write filename=out.raw \end{verbatim} The framework takes care of scheduling the tasks and distributing the data items to one or more GPUs. High throughput is achieved by the combination of fine- and coarse-grained data parallelism, \emph{i.e.} processing a single data item on a GPU using thousands of threads and by splitting the data stream and feeding individual data items to separate GPUs. None of this requires any user intervention and is solely determined by the framework in an automatized fashion. A complementary application programming interface allows users to develop custom applications in C or high-level languages such as Python. For example, with a high-level wrapper module users can express the use case presented in \ref{sec:beam monitoring} like this \begin{minted}{python} from ufo import DirectGma, Write dgma = DirectGma(device='/dev/fpga0') write = Write(filename='out.raw') # Execute and wait to finish write(dgma()).run().join() \end{minted} \section{Use cases} \label{sec:use cases} % \subsection{Hardware setups} Based on the architecture covered in Section \ref{sec:architecture} we present two example use cases motivating a setup involving FPGA-based DAQ and GPU-based processing. Section \ref{sec:image acquisition} outlines a camera system that combines frame acquisition with real-time reconstruction of volume data, while Section \ref{sec:beam monitoring} uses the GPU to determine bunch parameters in synchrotron beam diagnostics. In both examples, we will describe the setup in place and subsequently quantify improvements. We tested the proposed use cases on two different systems representing high-powered workstations and low-power, embedded systems. In both cases, we used a frontend FPGA board that is based on a Xilinx VC709 (Virtex-7 FPGA and PCIe x8 3.0) and an AMD FirePro W9100. System A is based on a Xeon E5-1630 CPU with an Intel C612 chipset and 128 GB of main memory. Due to the mainboard layout, both the FPGA and the PCIe devices are connected through different root complexes (RC). System B is a low-end Supermicro X7SPA-HF-D525 board with an Intel Atom D525 Dual Core CPU that is connected to an external Netstor NA255A PCIe enclosure. Unlike System A, the FPGA board and GPU share a common RC located inside the Netstor box. \subsection{Image acquisition and reconstruction} \label{sec:image acquisition} Custom FPGA logic allows for quick integration of image sensors for application requirements ranging from high throughput to high resolution as well as initial pre-processing of the image data. For example, we integrated a CMOS image sensors such as CMOSIS CMV2000, CMV4000 and CMV20000 on top of the FPGA hardware platform presented in Section~\ref{sec:fpga}~\cite{caselle2013ultrafast}. These custom cameras are employed in synchrotron X-ray imaging experiments such as absorption-based as well as grating-based phase contrast tomography~\cite{lytaev2014characterization}. Besides merely transmitting the final frames through PCIe to the host, the FPGA logic is concerned with sensor configuration, readout and digitization of the analog photon counts. User-oriented sensor configuration (e.g. exposure time, readout window, etc.) are mapped to 32-bit registers that are read and written from host. % Acquiring and processing 2D image data on the fly is a necessary task for many % control applications. Before the data can be analyzed, the hardware-specific data format needs to be decoded. In our case, we have a 10 to 12 bit packed format along with meta information about the entire frame and per scan line. As shown in \figref{fig:decoding}, an OpenCL kernel that shifts the pixel information and discards the meta data is able to decode the frame format efficiently and with a throughput X times larger than running SSE-optimized code on a Xeon XXX CPU. Thus decoding a frame before any computation is not impeding bottlenecks and in fact allows us to process at a lower latency. \begin{figure} \centering \begin{tikzpicture} \begin{axis}[ gotham/histogram, width=0.49\textwidth, height=5cm, xlabel={Decoding time in ms}, ylabel={Occurence}, bar width=1.2pt, ] \addplot file {data/decode/ipecam2.decode.gpu.hist.txt}; \addplot file {data/decode/ipecam2.decode.cpu.hist.txt}; \end{axis} \end{tikzpicture} \caption{% Decoding a range of frames on an AMD FirePro W9100 and a Xeon XXX. } \label{fig:decoding} \end{figure} The decoded data is then passed to the next stages that filter the rows in frequency space and backproject the data into the final volume in real space. \subsection{Beam monitoring} \label{sec:beam monitoring} % Extend motivation The characterization of an electron beam in synchrotrons is [...] ... Have a system in place that consists of a 1D spectrum analyzer that outputs 256 values per acquisition at a frequency of XXX Hz. The main pipeline consists of background subtraction of previously averaged background and modulated signals. and ... \subsection{Results} \begin{figure} \centering \begin{tikzpicture} \begin{axis}[ height=6cm, width=\columnwidth, gotham/line plot, bar width=5pt, xtick=data, x tick label style={ rotate=55, }, ylabel={Throughput (MB/s)}, symbolic x coords={ 4KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 2MB, 4MB, 8MB, 16MB, 32MB }, legend style={ at={(0.25, 0.95)}, cells={ anchor=west }, }, ] \addplot coordinates { (16KB, 106.178754056) (32KB, 211.084895305) (64KB, 415.703896443) (128KB, 810.339674944) (256KB, 1547.57365213) (512KB, 2776.37262474) (1MB, 5137.62674525) (2MB, 5915.08598317) (4MB, 6233.33653831) (8MB, 6276.50844112) (16MB, 6305.9174769) (32MB, 6307.81059127) }; \addplot coordinates { (16KB, 112.769066994) (32KB, 223.614235747) (64KB, 415.094840869) (128KB, 758.692184621) (256KB, 1301.14745592) (512KB, 2000.44858544) (1MB, 2726.52144668) (2MB, 4446.83980882) (4MB, 4908.10674445) (8MB, 5155.21548317) (16MB, 5858.33741922) (32MB, 5945.28752544) }; \legend{MT, ST} \end{axis} \end{tikzpicture} \caption{Data throughput from FPGA to GPU on a Setup xyz.} \label{fig:throughput} \end{figure} \section{Discussion} \label{sec:discussion} \section{Related work} \section{Conclusions} \label{sec:conclusion} In this paper, we presented a complete data acquisition and processing pipeline that focuses on low latencies and high throughput. It is based on an FPGA design for data readout and DMA transmission to host or GPU memory. On the GPU side we use AMDs DirectGMA OpenCL extension to provide the necessary physical memory addresses and [we'll see] signaling of data finishes. With this system, we are able to achieve data rates that match the PCIe specifications of up to 6.x GB/s for a PCIe 3.0 x8 connection. \section*{Acknowledgments} \bibliographystyle{abbrv} \bibliography{refs} \end{document}