\documentclass[journal]{IEEEtran}

\usepackage{pgfplots}
\usepackage{xspace}
\usepackage{todonotes}
\usepackage{minted}

\usepackage{gotham}

\usetikzlibrary{arrows}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usepgfplotslibrary{statistics}

%{{{ Meta data
\title{General Purpose FPGA-GPU Platform for High-Throughput DAQ and Processing}

\author{
Matthias Vogelgesang\\

}
%}}}

\newcommand{\figref}[1]{Figure~\ref{#1}}


\begin{document}

\maketitle

%{{{ Abstract
\begin{abstract}
  % Motivation

  % Problem
  Current generation GPUs are capable of processing TFLOP/s which in turn makes
  application with large bandwidth requirements and simple algorithms
  I/O-bound. Applications that receive data from external data sources are hit
  twice because data first has to be transferred into system main memory before
  being moved to the GPU in a second transfer.
  % Solution
  To remedy this problem, we designed and implemented a system architecture
  comprising a custom FPGA board with a flexible DMA transfer policy and a
  heterogeneous compute framework receiving data using AMD's DirectGMA
  OpenCL extension.
  % Conclusion
  With our proposed system architecture we are able to sustain the bandwidth
  requirements of various applications such as real-time tomographic image
  reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s.
\end{abstract}
%}}}

\begin{IEEEkeywords}
  GPGPU
\end{IEEEkeywords}


\section{Introduction}

GPU computing has become a cornerstone for manifold applications that require
large computional demands and exhibit an algorithmic pattern with a high degree
of parallelization. This includes signal reconstruction~\cite{ref},
recognition~\cite{ref} and analysis~\cite{emilyanov2012gpu} as well as
simulation~\cite{bussmann2013radiative} and deep
learning~\cite{krizhevsky2012imagenet}. With low costs of purchases and a
relatively straightforward SIMD programming model, GPUs have
become mainstream tools in industry and academia to solve the computational
problems associated with these fields.

Although GPUs harness a memory bandwidth that is far beyond a CPU's access to
system memory, the data transfer between host and GPU can quickly become the
main bottleneck for streaming systems and impede peak computation performance by
not delivering data fast enough. This becomes even worse for systems where data
does not originate from system memory but an external device. Typical examples
delivering high data rates include front-end Field Programmable Gate Arrays
(FPGA) for the digitization of analog signals. In this case, the data crosses
the PCI express (PCIe) bus twice to reach the GPU: once from FPGA to system
memory, second from system memory to GPU device memory. Considering
feedback-driven experiments this data path causes high latencies preventing GPUs
for certain application. On the other hand, copying data twice effectively
halves the total system bandwidth thus ...

In the remaining paper, we will introduce a hardware-software platform that
remedies these issues by decoupling data transfers between FPGA and GPU from the
host machine which is solely used to set up appropriate memory buffers and
orchestrates data transfer and kernel execution initiations. The system is
composed of a custom FPGA with a high performance DMA engine presented in
\ref{sec:fpga} and a high-level software layer that manages the OpenCL runtime
and gives users different means of accessessing the system as shown in
\ref{sec:opencl}. In \ref{sec:use cases}, we outline two example use cases for
our system, both requiring a high data throughput and present benchmark results.
We will discuss and conclude this paper in \ref{sec:discussion} and
\ref{sec:conclusion} respectively.


\section{Streamed data architecture}
\label{sec:architecture}

Besides providing high performance at low power as a co-processor for heavily
parallel and pipelined algorithms, FPGAs are also suited for custom data
acquisition (DAQ) applications because of to lower costs and faster development
time compared to application-specific integrated circuits (ASICs). Data is
streamed from the FPGA to the host machine using a variety of interconnects,
however PCIe is the only viable option for a standardized and high throughput
interconnect~\cite{pci2009specification}.

GPUs typically provide better performance for problems that can be solved using
Single-Instruction-Multiple-Data (SIMD) operations, in highly parallel but
non-pipelined fashion. Compared to FPGAs they also exhibit a simpler programming
model, i.e. algorithm development is much faster. Nevertheless, all data that is
processed on a GPU must be transferred through the PCIe bus.
% Redo combination
Combining those to platforms allow for fast digitization and quick data
assessment. In the following, we will present a hardware/software stack that
encompasses an FPGA DMA engine as well as DirectGMA-based data transfers that
allows us to stream data at peak PCIe bandwidth.

\begin{figure*}
  \centering

  \begin{tikzpicture}[
    box/.style={
      draw,
      minimum height=6mm,
      minimum width=16mm,
      text height=1.5ex,
      text depth=.25ex,
    },
    connection/.style={
      ->,
      >=stealth',
    },
  ]
    \node[box] (adc) {ADC};
    \node[box, right=3mm of adc] (logic) {Logic};
    \node[box, right=3mm of logic] (fifo) {FIFOs};
    \node[box, below=3mm of fifo] (regs) {Registers};

    \node[box, right=7cm of fifo] (gpu) {GPU};
    \node[box, right=2.7cm of regs] (cpu) {Host CPU};

    \node[draw, inner sep=5pt, dotted, fit=(adc) (logic) (fifo) (regs)] {};

    \draw[connection] (adc) -- (logic);
    \draw[connection] (logic) -- (fifo);
    \draw[connection] (cpu) -- node[below] {Set address} (regs);
    \draw[connection, <->] (fifo) -- node[above] {Transfer via DMA} (gpu);
    \draw[connection, <->] (logic) |- (regs);
    \draw[connection] (cpu.355) -| node[below, xshift=-15mm] {Prepare buffers} (gpu.310);
    \draw[connection] (gpu.230) |- (cpu.5) node[above, xshift=15mm] {Result};
  \end{tikzpicture}

  \label{fig:architecture}
  \caption{%
    Our streaming architecture that consists of a PCIe-based FPGA design with
    custom application logic and subsequent data processing on the GPU.
  }
\end{figure*}


\subsection{FPGA DMA engine}
\label{sec:fpga}

We have developed a DMA engine that provides a flexible scatter-gather memory
policy and minimizes resource utilization to around 3\% of the resources of a
Virtex-6 device~\cite{rota2015dma}. The engine is compatible with the Xilinx
PCIe 2.0/3.0 IP-Core for Xilinx FPGA families 6 and 7. Both DMA data transfers
between main system memory as well as GPU memory are supported. Two FIFOs, each
256 bits wide operate at 250 MHz and exchange data with the custom application
logic shown on the left of \figref{fig:architecture}. With this configuration,
the engine is capable of an input bandwidth of 7.45 GB/s.  The user logic and
the DMA engine are configured by the host system through 32 bit wide PIO
registers.

Regardless of the actual source of data, DMA transfers are started by writing
one or more physical addresses of the destination memory to a specific register.
The addresses are stored in an internal memory with a size of 4 KB, i.e.
spanning 1024 32-bit or 512 64-bit addresses. Each address may cover a range of
up to 2 GB of linear address space. However, due to the virtual addressing of
current CPU architectures, transfers to the main memory are limited to pages of
4 KB or 4 MB size. Unlike CPU memory, GPU buffers are flat-addressed and can be
filled at once. Updating the addresses in a dynamic fashion by either the driver
or the host application without fixed addresss, allows for efficient zero-copy
data transfers.


\subsection{OpenCL host management}
\label{sec:opencl}

% Reword copy pasta
On the host side, AMD's DirectGMA technology, an implementation of the
bus-addressable memory extension~\cite{amdbusaddressablememory} for OpenCL 1.1
and later, is used to write from the FPGA to GPU memory and from the GPU to the
FPGA's control registers. \figref{fig:architecture} illustrates the main mode
of operation: to write into the GPU, the physical bus addresses of the GPU
buffers are determined with a call to
\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
control register of the FPGA (1). The FPGA then writes data blocks autonomously
in DMA fashion (2). To signal events to the FPGA (4), the control registers can
be mapped into the GPU's address space passing a special AMD-specific flag and
passing the physical BAR address of the FPGA configuration memory to the
\texttt{cl\-Create\-Buffer} function. From the GPU, this memory is seen
transparently as regular GPU memory and can be written accordingly (3). In our
setup, trigger registers are used to notify the FPGA on successful or failed
evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
call it is possible to write entire memory regions in DMA fashion to the FPGA.
In this case, the GPU acts as bus master and pushes data to the FPGA.

Due to hardware limitations, GPU buffers that are made resident are restricted
to a hardware-dependent size. For example on AMD's FirePro W9100, the total
amount of GPU memory that can be allocated that way is about 95 MB. However,
larger transfers can be achieved by using a double buffering mechanism: data are
copied from the buffer exposed to the FPGA into a different location in GPU
memory. To verify that we can keep up with the incoming data throughput using
this strategy, we measured the data throughput within a GPU by copying data from
a smaller sized buffer representing the DMA buffer to a larger destination
buffer. At a block size of about 384 KB the throughput surpasses the maximum
possible PCIe bandwidth. Block transfers larger than 5 MB saturate the bandwidth
at 40 GB/s. Double buffering is therefore a viable solution for very large data
transfers, where throughput performance is favoured over latency. For data sizes
less than 95 MB, we can determine all addresses before the actual transfers thus
keeping the CPU out of the transfer loop.


\subsection{Heterogenous data processing}

To process the data, we encapsulated the DMA setup and memory mapping in a
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
This framework allows for an easy construction of streamed data processing on
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
from its specific data format and run a Fourier transform on the GPU as well as
writing back the results to disk, one can run the following on the command line:

\begin{verbatim}
ufo-launch direct-gma ! decode ! fft ! \
           write filename=out.raw
\end{verbatim}

The framework takes care of scheduling the tasks and distributing the data items
to one or more GPUs. High throughput is achieved by the combination of fine- and
coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
GPU using thousands of threads and by splitting the data stream and feeding
individual data items to separate GPUs. None of this requires any user
intervention and is solely determined by the framework in an automatized
fashion. A complementary application programming interface allows users to
develop custom applications in C or high-level languages such as Python. For
example, with a high-level wrapper module users can express the use case
presented in \ref{sec:beam monitoring} like this

\begin{minted}{python}
from ufo import DirectGma, Write

dgma = DirectGma(device='/dev/fpga0')
write = Write(filename='out.raw')

# Execute and wait to finish
write(dgma()).run().join()
\end{minted}


\section{Use cases}
\label{sec:use cases}

% \subsection{Hardware setups}

Based on the architecture covered in Section \ref{sec:architecture} we present
two example use cases motivating a setup involving FPGA-based DAQ and GPU-based
processing. Section \ref{sec:image acquisition} outlines a camera system that
combines frame acquisition with real-time reconstruction of volume data, while
Section \ref{sec:beam monitoring} uses the GPU to determine bunch parameters in
synchrotron beam diagnostics. In both examples, we will describe the setup in
place and subsequently quantify improvements.

We tested the proposed use cases on two different systems representing
high-powered workstations and low-power, embedded systems. In both cases, we
used a frontend FPGA board that is based on a Xilinx VC709 (Virtex-7 FPGA and
PCIe x8 3.0) and an AMD FirePro W9100. System A is based on a Xeon E5-1630 CPU
with an Intel C612 chipset and 128 GB of main memory. Due to the mainboard
layout, both the FPGA and the PCIe devices are connected through different root
complexes (RC). System B is a low-end Supermicro X7SPA-HF-D525 board with an
Intel Atom D525 Dual Core CPU that is connected to an external Netstor NA255A
PCIe enclosure. Unlike System A, the FPGA board and GPU share a common RC
located inside the Netstor box.


\subsection{Image acquisition and reconstruction}
\label{sec:image acquisition}

Custom FPGA logic allows for quick integration of image sensors for application
requirements ranging from high throughput to high resolution as well as initial
pre-processing of the image data. For example, we integrated a CMOS image
sensors such as CMOSIS CMV2000, CMV4000 and CMV20000 on top of the FPGA hardware
platform presented in Section~\ref{sec:fpga}~\cite{caselle2013ultrafast}. These
custom cameras are employed in synchrotron X-ray imaging experiments such as
absorption-based as well as grating-based phase contrast
tomography~\cite{lytaev2014characterization}.  Besides merely transmitting the
final frames through PCIe to the host, the FPGA logic is concerned with sensor
configuration, readout and digitization of the analog photon counts.
User-oriented sensor configuration (e.g. exposure time, readout window, etc.)
are mapped to 32-bit registers that are read and written from host.
% Acquiring and processing 2D image data on the fly is a necessary task for many
% control applications.

Before the data can be analyzed, the hardware-specific data format needs to be
decoded. In our case, we have a 10 to 12 bit packed format along with meta
information about the entire frame and per scan line. As shown in
\figref{fig:decoding}, an OpenCL kernel that shifts the pixel information and
discards the meta data is able to decode the frame format efficiently and with a
throughput X times larger than running SSE-optimized code on a Xeon XXX CPU.
Thus decoding a frame before any computation is not impeding bottlenecks and in
fact allows us to process at a lower latency.

\begin{figure}
  \centering

  \begin{tikzpicture}
    \begin{axis}[
        gotham/histogram,
        width=0.49\textwidth,
        height=5cm,
        xlabel={Decoding time in ms},
        ylabel={Occurence},
        bar width=1.2pt,
    ]
      \addplot file {data/decode/ipecam2.decode.gpu.hist.txt};
      \addplot file {data/decode/ipecam2.decode.cpu.hist.txt};
    \end{axis}
  \end{tikzpicture}
  \caption{%
    Decoding a range of frames on an AMD FirePro W9100 and a Xeon
    XXX.
  }
  \label{fig:decoding}
\end{figure}

The decoded data is then passed to the next stages that filter the rows in
frequency space and backproject the data into the final volume in real space.


\subsection{Beam monitoring}
\label{sec:beam monitoring}

% Extend motivation
The characterization of an electron beam in synchrotrons is [...] ... Have a
system in place that consists of a 1D spectrum analyzer that outputs 256 values
per acquisition at a frequency of XXX Hz.

The main pipeline consists of background subtraction of previously averaged
background and modulated signals. and ...


\subsection{Results}

\begin{figure}
  \centering
  \begin{tikzpicture}
    \begin{axis}[
        height=6cm,
        width=\columnwidth,
        gotham/line plot,
        bar width=5pt,
        xtick=data,
        x tick label style={
          rotate=55,
        },
        ylabel={Throughput (MB/s)},
        symbolic x coords={
          4KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB,
          1MB, 2MB, 4MB, 8MB, 16MB, 32MB
        },
        legend style={
          at={(0.25, 0.95)},
          cells={
            anchor=west
          },
        },
      ]

      \addplot coordinates {
        (16KB, 106.178754056)
        (32KB, 211.084895305)
        (64KB, 415.703896443)
        (128KB, 810.339674944)
        (256KB, 1547.57365213)
        (512KB, 2776.37262474)
        (1MB, 5137.62674525)
        (2MB, 5915.08598317)
        (4MB, 6233.33653831)
        (8MB, 6276.50844112)
        (16MB, 6305.9174769)
        (32MB, 6307.81059127)
      };

      \addplot coordinates {
        (16KB, 112.769066994)
        (32KB, 223.614235747)
        (64KB, 415.094840869)
        (128KB, 758.692184621)
        (256KB, 1301.14745592)
        (512KB, 2000.44858544)
        (1MB, 2726.52144668)
        (2MB, 4446.83980882)
        (4MB, 4908.10674445)
        (8MB, 5155.21548317)
        (16MB, 5858.33741922)
        (32MB, 5945.28752544)
      };

      \legend{MT, ST}
    \end{axis}
  \end{tikzpicture}
  \caption{Data throughput from FPGA to GPU on a Setup xyz.}
  \label{fig:throughput}
\end{figure}

\section{Discussion}
\label{sec:discussion}


\section{Related work}


\section{Conclusions}
\label{sec:conclusion}

In this paper, we presented a complete data acquisition and processing pipeline
that focuses on low latencies and high throughput. It is based on an FPGA design
for data readout and DMA transmission to host or GPU memory. On the GPU side we
use AMDs DirectGMA OpenCL extension to provide the necessary physical memory
addresses and [we'll see] signaling of data finishes. With this system, we are
able to achieve data rates that match the PCIe specifications of up to 6.x GB/s
for a PCIe 3.0 x8 connection.


\section*{Acknowledgments}


\bibliographystyle{abbrv}
\bibliography{refs}


\end{document}