|
@@ -0,0 +1,437 @@
|
|
|
+\documentclass[journal]{IEEEtran}
|
|
|
+
|
|
|
+\usepackage{pgfplots}
|
|
|
+\usepackage{xspace}
|
|
|
+\usepackage{todonotes}
|
|
|
+\usepackage{minted}
|
|
|
+
|
|
|
+\usepackage{gotham}
|
|
|
+
|
|
|
+\usetikzlibrary{arrows}
|
|
|
+\usetikzlibrary{calc}
|
|
|
+\usepgfplotslibrary{statistics}
|
|
|
+
|
|
|
+%{{{ Meta data
|
|
|
+\title{General Gurpose FPGA-GPU Platform for High-Throughput DAQ and Processing}
|
|
|
+
|
|
|
+\author{
|
|
|
+Matthias Vogelgesang
|
|
|
+}
|
|
|
+%}}}
|
|
|
+
|
|
|
+\newcommand{\figref}[1]{Figure~\ref{#1}}
|
|
|
+
|
|
|
+
|
|
|
+\begin{document}
|
|
|
+
|
|
|
+\maketitle
|
|
|
+
|
|
|
+%{{{ Abstract
|
|
|
+\begin{abstract}
|
|
|
+ % Motivation
|
|
|
+
|
|
|
+ % Problem
|
|
|
+ Current generation GPUs are capable of processing TFLOP/s which in turn makes
|
|
|
+ application with large bandwidth requirements and simple algorithms
|
|
|
+ I/O-bound. Applications that receive data from external data sources are hit
|
|
|
+ twice because data first has to be transferred into system main memory before
|
|
|
+ being moved to the GPU in a second transfer.
|
|
|
+ % Solution
|
|
|
+ To remedy this problem, we designed and implemented a system architecture
|
|
|
+ comprising a custom FPGA board with a flexible DMA transfer policy and a
|
|
|
+ heterogeneous compute framework receiving data using AMD's DirectGMA
|
|
|
+ OpenCL extension.
|
|
|
+ % Conclusion
|
|
|
+ With our proposed system architecture we are able to sustain the bandwidth
|
|
|
+ requirements of various applications such as real-time tomographic image
|
|
|
+ reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s.
|
|
|
+\end{abstract}
|
|
|
+%}}}
|
|
|
+
|
|
|
+\begin{IEEEkeywords}
|
|
|
+ GPGPU
|
|
|
+\end{IEEEkeywords}
|
|
|
+
|
|
|
+
|
|
|
+\section{Introduction}
|
|
|
+
|
|
|
+GPU computing has become a cornerstone for manifold applications that require
|
|
|
+large computional demands and exhibit an algorithmic pattern with a high degree
|
|
|
+of parallelization. This includes signal reconstruction~\cite{ref},
|
|
|
+recognition~\cite{ref} and analysis~\cite{emilyanov2012gpu} as well as
|
|
|
+simulation~\cite{bussmann2013radiative} and deep
|
|
|
+learning~\cite{krizhevsky2012imagenet}. With low costs of purchases and a
|
|
|
+relatively straightforward SIMD programming model, GPUs have
|
|
|
+become mainstream tools in industry and academia to solve the computational
|
|
|
+problems associated with these fields.
|
|
|
+
|
|
|
+Although GPUs harness a memory bandwidth that is far beyond a CPU's access to
|
|
|
+system memory, the data transfer between host and GPU can quickly become the
|
|
|
+main bottleneck for streaming systems and impede peak computation performance by
|
|
|
+not delivering data fast enough. This becomes even worse for systems where data
|
|
|
+does not originate from system memory but an external device. Typical examples
|
|
|
+delivering high data rates include front-end Field Programmable Gate Arrays
|
|
|
+(FPGA) for the digitization of analog signals. In this case, the data crosses
|
|
|
+the PCI express (PCIe) bus twice to reach the GPU: once from FPGA to system
|
|
|
+memory, second from system memory to GPU device memory. Considering
|
|
|
+feedback-driven experiments this data path causes high latencies preventing GPUs
|
|
|
+for certain application. On the other hand, copying data twice effectively
|
|
|
+halves the total system bandwidth thus ...
|
|
|
+
|
|
|
+In the remaining paper, we will introduce a hardware-software platform that
|
|
|
+remedies these issues by decoupling data transfers between FPGA and GPU from the
|
|
|
+host machine which is solely used to set up appropriate memory buffers and
|
|
|
+orchestrates data transfer and kernel execution initiations. The system is
|
|
|
+composed of a custom FPGA with a high performance DMA engine presented in
|
|
|
+\ref{sec:fpga} and a high-level software layer that manages the OpenCL runtime
|
|
|
+and gives users different means of accessessing the system as shown in
|
|
|
+\ref{sec:opencl}. In \ref{sec:use cases}, we outline two example use cases for
|
|
|
+our system, both requiring a high data throughput and present benchmark results.
|
|
|
+We will discuss and conclude this paper in \ref{sec:discussion} and
|
|
|
+\ref{sec:conclusion} respectively.
|
|
|
+
|
|
|
+
|
|
|
+\section{Streamed data architecture}
|
|
|
+\label{sec:architecture}
|
|
|
+
|
|
|
+Besides providing high performance at low power as a co-processor for heavily
|
|
|
+parallel and pipelined algorithms, FPGAs are also suited for custom data
|
|
|
+acquisition (DAQ) applications because of to lower costs and faster development
|
|
|
+time compared to application-specific integrated circuits (ASICs). Data is
|
|
|
+streamed from the FPGA to the host machine using a variety of interconnects,
|
|
|
+however PCIe is the only viable option for a standardized and high throughput
|
|
|
+interconnect~\cite{pci2009specification}.
|
|
|
+
|
|
|
+GPUs typically provide better performance for problems that can be solved using
|
|
|
+Single-Instruction-Multiple-Data (SIMD) operations, in highly parallel but
|
|
|
+non-pipelined fashion. Compared to FPGAs they also exhibit a simpler programming
|
|
|
+model, i.e. algorithm development is much faster. Nevertheless, all data that is
|
|
|
+processed on a GPU must be transferred through the PCIe bus.
|
|
|
+% Redo combination
|
|
|
+Combining those to platforms allow for fast digitization and quick data
|
|
|
+assessment. In the following, we will present a hardware/software stack that
|
|
|
+encompasses an FPGA DMA engine as well as DirectGMA-based data transfers that
|
|
|
+allows us to stream data at peak PCIe bandwidth.
|
|
|
+
|
|
|
+\begin{figure*}
|
|
|
+ \centering
|
|
|
+
|
|
|
+ \begin{tikzpicture}[
|
|
|
+ box/.style={
|
|
|
+ draw,
|
|
|
+ minimum height=6mm,
|
|
|
+ minimum width=16mm,
|
|
|
+ text height=1.5ex,
|
|
|
+ text depth=.25ex,
|
|
|
+ },
|
|
|
+ connection/.style={
|
|
|
+ ->,
|
|
|
+ >=stealth',
|
|
|
+ },
|
|
|
+ ]
|
|
|
+ \node[box] (adc) {ADC};
|
|
|
+ \node[box, right=3mm of adc] (logic) {Logic};
|
|
|
+ \node[box, right=3mm of logic] (fifo) {FIFOs};
|
|
|
+ \node[box, below=3mm of fifo] (regs) {Registers};
|
|
|
+
|
|
|
+ \node[box, right=7cm of fifo] (gpu) {GPU};
|
|
|
+ \node[box, right=2.7cm of regs] (cpu) {Host CPU};
|
|
|
+
|
|
|
+ \draw[connection] (adc) -- (logic);
|
|
|
+ \draw[connection] (logic) -- (fifo);
|
|
|
+ \draw[connection] (cpu) -- node[below] {Set address} (regs);
|
|
|
+ \draw[connection] (cpu) -| node[below] {Prepare buffers} (gpu);
|
|
|
+ \draw[connection, <->] (fifo) -- node[above] {DMA transfers} (gpu);
|
|
|
+ \draw[connection, <->] (logic) |- (regs);
|
|
|
+ \end{tikzpicture}
|
|
|
+
|
|
|
+ \label{fig:architecture}
|
|
|
+ \caption{%
|
|
|
+ Streaming architecture consisting of a PCIe-based FPGA design with
|
|
|
+ custom application logic and subsequent data processing on the GPU.
|
|
|
+ }
|
|
|
+\end{figure*}
|
|
|
+
|
|
|
+
|
|
|
+\subsection{FPGA DMA engine}
|
|
|
+\label{sec:fpga}
|
|
|
+
|
|
|
+We have developed a DMA engine that provides a flexible scatter-gather memory
|
|
|
+policy and minimizes resource utilization to around 3\% of the resources of a
|
|
|
+Virtex-6 device~\cite{rota2015dma}. The engine is compatible with the Xilinx
|
|
|
+PCIe 2.0/3.0 IP-Core for Xilinx FPGA families 6 and 7. Both DMA data transfers
|
|
|
+between main system memory as well as GPU memory are supported. Two FIFOs, each
|
|
|
+256 bits wide operate at 250 MHz and exchange data with the custom application
|
|
|
+logic shown on the left of \figref{fig:architecture}. With this configuration,
|
|
|
+the engine is capable of an input bandwidth of 7.45 GB/s. The user logic and
|
|
|
+the DMA engine are configured by the host system through 32 bit wide PIO
|
|
|
+registers.
|
|
|
+
|
|
|
+Regardless of the actual source of data, DMA transfers are started by writing
|
|
|
+one or more physical addresses of the destination memory to a specific register.
|
|
|
+The addresses are stored in an internal memory with a size of 4 KB, i.e.
|
|
|
+spanning 1024 32-bit or 512 64-bit addresses. Each address may cover a range of
|
|
|
+up to 2 GB of linear address space. However, due to the virtual addressing of
|
|
|
+current CPU architectures, transfers to the main memory are limited to pages of
|
|
|
+4 KB or 4 MB size. Unlike CPU memory, GPU buffers are flat-addressed and can be
|
|
|
+filled at once. Updating the addresses in a dynamic fashion by either the driver
|
|
|
+or the host application without fixed addresss, allows for efficient zero-copy
|
|
|
+data transfers.
|
|
|
+
|
|
|
+
|
|
|
+\subsection{OpenCL host management}
|
|
|
+\label{sec:opencl}
|
|
|
+
|
|
|
+% Reword copy pasta
|
|
|
+On the host side, AMD's DirectGMA technology, an implementation of the
|
|
|
+bus-addressable memory extension~\cite{amdbusaddressablememory} for OpenCL 1.1
|
|
|
+and later, is used to write from the FPGA to GPU memory and from the GPU to the
|
|
|
+FPGA's control registers. \figref{fig:architecture} illustrates the main mode
|
|
|
+of operation: to write into the GPU, the physical bus addresses of the GPU
|
|
|
+buffers are determined with a call to
|
|
|
+\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
|
|
|
+control register of the FPGA (1). The FPGA then writes data blocks autonomously
|
|
|
+in DMA fashion (2). To signal events to the FPGA (4), the control registers can
|
|
|
+be mapped into the GPU's address space passing a special AMD-specific flag and
|
|
|
+passing the physical BAR address of the FPGA configuration memory to the
|
|
|
+\texttt{cl\-Create\-Buffer} function. From the GPU, this memory is seen
|
|
|
+transparently as regular GPU memory and can be written accordingly (3). In our
|
|
|
+setup, trigger registers are used to notify the FPGA on successful or failed
|
|
|
+evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
|
|
|
+call it is possible to write entire memory regions in DMA fashion to the FPGA.
|
|
|
+In this case, the GPU acts as bus master and pushes data to the FPGA.
|
|
|
+
|
|
|
+Due to hardware limitations, GPU buffers that are made resident are restricted
|
|
|
+to a hardware-dependent size. For example on AMD's FirePro W9100, the total
|
|
|
+amount of GPU memory that can be allocated that way is about 95 MB. However,
|
|
|
+larger transfers can be achieved by using a double buffering mechanism: data are
|
|
|
+copied from the buffer exposed to the FPGA into a different location in GPU
|
|
|
+memory. To verify that we can keep up with the incoming data throughput using
|
|
|
+this strategy, we measured the data throughput within a GPU by copying data from
|
|
|
+a smaller sized buffer representing the DMA buffer to a larger destination
|
|
|
+buffer. At a block size of about 384 KB the throughput surpasses the maximum
|
|
|
+possible PCIe bandwidth. Block transfers larger than 5 MB saturate the bandwidth
|
|
|
+at 40 GB/s. Double buffering is therefore a viable solution for very large data
|
|
|
+transfers, where throughput performance is favoured over latency. For data sizes
|
|
|
+less than 95 MB, we can determine all addresses before the actual transfers thus
|
|
|
+keeping the CPU out of the transfer loop.
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Heterogenous data processing}
|
|
|
+
|
|
|
+To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
|
+plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
|
|
|
+This framework allows for an easy construction of streamed data processing on
|
|
|
+heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
|
|
|
+from its specific data format and run a Fourier transform on the GPU as well as
|
|
|
+writing back the results to disk, one can run the following on the command line:
|
|
|
+
|
|
|
+\begin{verbatim}
|
|
|
+ufo-launch direct-gma ! decode ! fft ! \
|
|
|
+ write filename=out.raw
|
|
|
+\end{verbatim}
|
|
|
+
|
|
|
+The framework takes care of scheduling the tasks and distributing the data items
|
|
|
+to one or more GPUs. High throughput is achieved by the combination of fine- and
|
|
|
+coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
|
|
|
+GPU using thousands of threads and by splitting the data stream and feeding
|
|
|
+individual data items to separate GPUs. None of this requires any user
|
|
|
+intervention and is solely determined by the framework in an automatized
|
|
|
+fashion. A complementary application programming interface allows users to
|
|
|
+develop custom applications in C or high-level languages such as Python. For
|
|
|
+example, with a high-level wrapper module users can express the use case
|
|
|
+presented in \ref{sec:beam monitoring} like this
|
|
|
+
|
|
|
+\begin{minted}{python}
|
|
|
+from ufo import DirectGma, Write
|
|
|
+
|
|
|
+dgma = DirectGma(device='/dev/fpga0')
|
|
|
+write = Write(filename='out.raw')
|
|
|
+
|
|
|
+# Execute and wait to finish
|
|
|
+write(dgma()).run().join()
|
|
|
+\end{minted}
|
|
|
+
|
|
|
+
|
|
|
+\section{Use cases}
|
|
|
+\label{sec:use cases}
|
|
|
+
|
|
|
+% \subsection{Hardware setups}
|
|
|
+
|
|
|
+Based on the architecture covered in Section \ref{sec:architecture} we present
|
|
|
+two example use cases motivating a setup involving FPGA-based DAQ and GPU-based
|
|
|
+processing. Section \ref{sec:image acquisition} outlines a camera system that
|
|
|
+combines frame acquisition with real-time reconstruction of volume data, while
|
|
|
+Section \ref{sec:beam monitoring} uses the GPU to determine bunch parameters in
|
|
|
+synchrotron beam diagnostics. In both examples, we will describe the setup in
|
|
|
+place and subsequently quantify improvements.
|
|
|
+
|
|
|
+We tested the proposed use cases on two different systems representing
|
|
|
+high-powered workstations and low-power, embedded systems. In both cases, we
|
|
|
+used a frontend FPGA board that is based on a Xilinx VC709 (Virtex-7 FPGA and
|
|
|
+PCIe x8 3.0) and an AMD FirePro W9100. System A is based on a Xeon E5-1630 CPU
|
|
|
+with an Intel C612 chipset and 128 GB of main memory. Due to the mainboard
|
|
|
+layout, both the FPGA and the PCIe devices are connected through different root
|
|
|
+complexes (RC). System B is a low-end Supermicro X7SPA-HF-D525 board with an
|
|
|
+Intel Atom D525 Dual Core CPU that is connected to an external Netstor NA255A
|
|
|
+PCIe enclosure. Unlike System A, the FPGA board and GPU share a common RC
|
|
|
+located inside the Netstor box.
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Image acquisition and reconstruction}
|
|
|
+\label{sec:image acquisition}
|
|
|
+
|
|
|
+Custom FPGA logic allows for quick integration of image sensors for application
|
|
|
+requirements ranging from high throughput to high resolution as well as initial
|
|
|
+pre-processing of the image data. For example, we integrated a CMOS image
|
|
|
+sensors such as CMOSIS CMV2000, CMV4000 and CMV20000 on top of the FPGA hardware
|
|
|
+platform presented in Section~\ref{sec:fpga}~\cite{caselle2013ultrafast}. These
|
|
|
+custom cameras are employed in synchrotron X-ray imaging experiments such as
|
|
|
+absorption-based as well as grating-based phase contrast
|
|
|
+tomography~\cite{lytaev2014characterization}. Besides merely transmitting the
|
|
|
+final frames through PCIe to the host, the FPGA logic is concerned with sensor
|
|
|
+configuration, readout and digitization of the analog photon counts.
|
|
|
+User-oriented sensor configuration (e.g. exposure time, readout window, etc.)
|
|
|
+are mapped to 32-bit registers that are read and written from host.
|
|
|
+% Acquiring and processing 2D image data on the fly is a necessary task for many
|
|
|
+% control applications.
|
|
|
+
|
|
|
+Before the data can be analyzed, the hardware-specific data format needs to be
|
|
|
+decoded. In our case, we have a 10 to 12 bit packed format along with meta
|
|
|
+information about the entire frame and per scan line. As shown in
|
|
|
+\figref{fig:decoding}, an OpenCL kernel that shifts the pixel information and
|
|
|
+discards the meta data is able to decode the frame format efficiently and with a
|
|
|
+throughput X times larger than running SSE-optimized code on a Xeon XXX CPU.
|
|
|
+Thus decoding a frame before any computation is not impeding bottlenecks and in
|
|
|
+fact allows us to process at a lower latency.
|
|
|
+
|
|
|
+\begin{figure}
|
|
|
+ \centering
|
|
|
+
|
|
|
+ \begin{tikzpicture}
|
|
|
+ \begin{axis}[
|
|
|
+ gotham/histogram,
|
|
|
+ width=0.49\textwidth,
|
|
|
+ height=5cm,
|
|
|
+ xlabel={Decoding time in ms},
|
|
|
+ ylabel={Occurence},
|
|
|
+ bar width=1.2pt,
|
|
|
+ ]
|
|
|
+ \addplot file {data/decode/ipecam2.decode.gpu.hist.txt};
|
|
|
+ \addplot file {data/decode/ipecam2.decode.cpu.hist.txt};
|
|
|
+ \end{axis}
|
|
|
+ \end{tikzpicture}
|
|
|
+ \caption{%
|
|
|
+ Decoding a range of frames on an AMD FirePro W9100 and a Xeon
|
|
|
+ XXX.
|
|
|
+ }
|
|
|
+ \label{fig:decoding}
|
|
|
+\end{figure}
|
|
|
+
|
|
|
+The decoded data is then passed to the next stages that filter the rows in
|
|
|
+frequency space and backproject the data into the final volume in real space.
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Beam monitoring}
|
|
|
+\label{sec:beam monitoring}
|
|
|
+
|
|
|
+% Extend motivation
|
|
|
+The characterization of an electron beam in synchrotrons is [...] ... Have a
|
|
|
+system in place that consists of a 1D spectrum analyzer that outputs 256 values
|
|
|
+per acquisition at a frequency of XXX Hz.
|
|
|
+
|
|
|
+The main pipeline consists of background subtraction of previously averaged
|
|
|
+background and modulated signals. and ...
|
|
|
+
|
|
|
+
|
|
|
+\subsection{Results}
|
|
|
+
|
|
|
+\begin{figure}
|
|
|
+ \centering
|
|
|
+ \begin{tikzpicture}
|
|
|
+ \begin{axis}[
|
|
|
+ height=6cm,
|
|
|
+ width=\columnwidth,
|
|
|
+ gotham/line plot,
|
|
|
+ bar width=5pt,
|
|
|
+ xtick=data,
|
|
|
+ x tick label style={
|
|
|
+ rotate=55,
|
|
|
+ },
|
|
|
+ ylabel={Throughput (MB/s)},
|
|
|
+ symbolic x coords={
|
|
|
+ 4KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB,
|
|
|
+ 1MB, 2MB, 4MB, 8MB, 16MB, 32MB
|
|
|
+ },
|
|
|
+ legend style={
|
|
|
+ at={(0.25, 0.95)},
|
|
|
+ cells={
|
|
|
+ anchor=west
|
|
|
+ },
|
|
|
+ },
|
|
|
+ ]
|
|
|
+
|
|
|
+ \addplot coordinates {
|
|
|
+ (16KB, 106.178754056)
|
|
|
+ (32KB, 211.084895305)
|
|
|
+ (64KB, 415.703896443)
|
|
|
+ (128KB, 810.339674944)
|
|
|
+ (256KB, 1547.57365213)
|
|
|
+ (512KB, 2776.37262474)
|
|
|
+ (1MB, 5137.62674525)
|
|
|
+ (2MB, 5915.08598317)
|
|
|
+ (4MB, 6233.33653831)
|
|
|
+ (8MB, 6276.50844112)
|
|
|
+ (16MB, 6305.9174769)
|
|
|
+ (32MB, 6307.81059127)
|
|
|
+ };
|
|
|
+
|
|
|
+ \addplot coordinates {
|
|
|
+ (16KB, 112.769066994)
|
|
|
+ (32KB, 223.614235747)
|
|
|
+ (64KB, 415.094840869)
|
|
|
+ (128KB, 758.692184621)
|
|
|
+ (256KB, 1301.14745592)
|
|
|
+ (512KB, 2000.44858544)
|
|
|
+ (1MB, 2726.52144668)
|
|
|
+ (2MB, 4446.83980882)
|
|
|
+ (4MB, 4908.10674445)
|
|
|
+ (8MB, 5155.21548317)
|
|
|
+ (16MB, 5858.33741922)
|
|
|
+ (32MB, 5945.28752544)
|
|
|
+ };
|
|
|
+
|
|
|
+ \legend{MT, ST}
|
|
|
+ \end{axis}
|
|
|
+ \end{tikzpicture}
|
|
|
+ \caption{Data throughput from FPGA to GPU on a Setup xyz.}
|
|
|
+ \label{fig:throughput}
|
|
|
+\end{figure}
|
|
|
+
|
|
|
+\section{Discussion}
|
|
|
+\label{sec:discussion}
|
|
|
+
|
|
|
+
|
|
|
+\section{Related work}
|
|
|
+
|
|
|
+
|
|
|
+\section{Conclusions}
|
|
|
+\label{sec:conclusion}
|
|
|
+
|
|
|
+In this paper, we presented a complete data acquisition and processing pipeline
|
|
|
+that focuses on low latencies and high throughput. It is based on an FPGA design
|
|
|
+for data readout and DMA transmission to host or GPU memory. On the GPU side we
|
|
|
+use AMDs DirectGMA OpenCL extension to provide the necessary physical memory
|
|
|
+addresses and [we'll see] signaling of data finishes. With this system, we are
|
|
|
+able to achieve data rates that match the PCIe specifications of up to 6.x GB/s
|
|
|
+for a PCIe 3.0 x8 connection.
|
|
|
+
|
|
|
+
|
|
|
+\section*{Acknowledgments}
|
|
|
+
|
|
|
+
|
|
|
+\bibliographystyle{abbrv}
|
|
|
+\bibliography{refs}
|
|
|
+
|
|
|
+
|
|
|
+\end{document}
|