\documentclass{JINST}

\usepackage[utf8]{inputenc}
\usepackage{lineno}
\usepackage{ifthen}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{textcomp}
\usepackage{booktabs}
\usepackage{floatrow}
\newfloatcommand{capbtabbox}{table}[][\FBwidth]

\newboolean{draft}
\setboolean{draft}{true}

\newcommand{\figref}[1]{Figure~\ref{#1}}

\title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}

\author{
  L.~Rota$^a$,
  M.~Vogelgesang$^a$,
  L.E.~Ardila Perez$^a$,
  M.~Caselle$^a$,
  S.~Chilingaryan$^a$,
  T.~Dritschler$^a$,
  N.~Zilio$^a$,
  A.~Kopmann$^a$,
  M.~Balzer$^a$,
  M.~Weber$^a$\\
  \llap{$^a$}Institute for Data Processing and Electronics,\\
    Karlsruhe Institute of Technology (KIT),\\
    Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
    E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
}

\abstract{ Modern physics experiments have reached multi-GB/s data rates. Fast
data   links and high performance computing stages are required for continuous
data   acquisition and processing. Because of their intrinsic parallelism and
computational power, GPUs emerged as an ideal solution to process this data in
high performance computing applications. In this paper we present a   high-
throughput platform based on direct FPGA-GPU communication.    The
architecture consists of a   Direct Memory Access (DMA) engine compatible with
the Xilinx PCI-Express core,   a Linux driver for register access, and high-
level software to manage direct   memory transfers using AMD's DirectGMA
technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
the possibility of using our architecture in low latency systems: preliminary
measurements show a round-trip latency as low as 1 \textmu s for data
transfers to system memory, while the additional latency introduced by OpenCL
scheduling is the current limitation for GPU based systems.  Our
implementation is suitable for real- time DAQ system applications ranging from
photon science and medical imaging to High Energy Physics (HEP) systems.}

\keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}

\begin{document}

\ifdraft
\setpagewiselinenumbers
\linenumbers
\fi


\section{Introduction}

GPU computing has become the main driving force for high performance computing
due to an unprecedented parallelism and a low cost-benefit factor. GPU
acceleration has found its way into numerous applications, ranging from
simulation to image processing. 

The data rates of bio-imaging or beam-monitoring experiments running in
current generation photon science facilities have reached tens of
GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
back-end readout systems and then transmitted in short bursts or in a
continuous streaming mode to a computing stage. In order to collect data over
long observation times, the readout architecture and the computing stages must
be able to sustain high data rates. Recent years have also seen an increasing
interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
applications, such as Low/High-level trigger systems, latency becomes
the most stringent requirement.

Due to its high bandwidth and modularity, PCIe quickly became the commercial
standard for connecting high-throughput peripherals such as GPUs or solid
state disks. Moreover, optical PCIe networks have been demonstrated a decade
ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
communication link over long distances.

Several solutions for direct FPGA-GPU communication based on PCIe are reported
in literature, and all of them are based on NVIDIA's GPUdirect technology. In
the implementation of Bittnerner and Ruf ~\cite{bittner} the GPU acts as
master during an FPGA-to-GPU data transfer, reading data from the FPGA.  This
solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
s, respectively. When the FPGA is used as a master, a higher throughput can be
achieved.  An example of this approach is the \emph{FPGA\textsuperscript{2}}
framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x
Gen2.0 data link. Lonardo et~al.\ achieved low latencies with their NaNet
design, an FPGA-based PCIe network interface card~\cite{lonardo2015nanet}. The
Gbe link however limits the latency performance of the system to a few tens of
\textmu s. If only the FPGA-to-GPU latency is considered, the measured values
span between 1~\textmu s and 6~\textmu s, depending on the datagram size.
Nieto et~al.\ presented a system based on a PXIexpress data link that makes
use of four PCIe 1.0 links~\cite{nieto2015high}. Their system (as limited by
the interconnect) achieves an average throughput of 870 MB/s with 1 KB block
transfers.

In order to achieve the best performance in terms of latency and bandwidth, we
developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
process the data, we encapsulated the DMA setup and memory mapping in a plugin
for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
framework allows for an easy construction of streamed data processing on
heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
integration with NVIDIA's CUDA functions for GPUDirect technology is not
possible at the moment. We therefore used AMD's DirectGMA technology to
integrate direct FPGA-to-GPU communication into our processing pipeline. In
this paper we report the throughput performance of our architecture together
with some preliminary measurements about DirectGMA's applicability in low-
latency applications.

%% LR: this part -> OK
\section{Architecture}

As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
data through system main memory by copying data from the FPGA into
intermediate buffers and then finally into the GPU's main memory. Thus, the
total throughput and latency of the system is limited by the main memory
bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
communication between GPUs and auxiliary devices over PCIe. By combining this
technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
overall latency of the system is reduced and total throughput increased.
Moreover, the CPU and main system memory are relieved from processing because
they are not directly involved in the data transfer anymore.

\begin{figure}[t]
  \centering
  \includegraphics[width=1.0\textwidth]{figures/transf}
  \caption{%
    In a traditional DMA architecture (a), data are first written to the main
    system memory and then sent to the GPUs for final processing.  By using
    GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
    the GPU's internal memory.
  }
  \label{fig:trad-vs-dgpu}
\end{figure}

%% LR: this part -> Text:OK, Figure: must be updated
\subsection{DMA engine implementation on the FPGA}

We have developed a DMA engine that minimizes resource utilization while
maintaining the flexibility of a Scatter-Gather memory
policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}. The engine is compatible with the Xilinx PCIe
Gen2/3 IP- Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA data
transfers to/from main system memory and GPU memory are supported. Two FIFOs,
with a data width of 256 bits and operating at 250 MHz, act as user- friendly
interfaces with the custom logic with an input bandwidth of 7.45 GB/s. The
user logic and the DMA engine are configured by the host through PIO
registers. The resource
utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.


\begin{figure}[t]
\small
\begin{floatrow}
\ffigbox{%
    \includegraphics[width=0.4\textwidth]{figures/fpga-arch}
}{%
  \caption{A figure}%
  \label{fig:fpga-arch}
}
\capbtabbox{%
\begin{tabular}{@{}llll@{}}
  \toprule
  Resource & Utilization & (\%) \\
    \midrule
  LUT      & 5331  & (1.23)           \\
  LUTRAM   & 56    & (0.03)           \\
  FF       & 5437  & (0.63)           \\
  BRAM     & 21    & (1.39)           \\
    \bottomrule
  \end{tabular}
}{%
  \caption{Resource utilization on a xc7vx690t-ffg1761 device}%
  \label{table:utilization}
}
\end{floatrow}
\end{figure}

The physical addresses of the host's memory buffers are stored into an internal
memory and are dynamically updated by the driver or user, allowing highly
efficient zero-copy data transfers. The maximum size associated with each
address is 2 GB. 


%% LR: -----------------> OK

\subsection{OpenCL management on host side}
\label{sec:host}

\begin{figure}[b]
  \centering
  \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  \caption{The FPGA writes to GPU memory by mapping the physical address of a
  GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  mapping the FPGA control registers into the address space of the GPU.}
  \label{fig:opencl-setup}
\end{figure}

%% Description of figure
On the host side, AMD's DirectGMA technology, an implementation of the bus-
addressable memory extension for OpenCL 1.1 and later, is used to write from
the FPGA to GPU memory and from the GPU to the FPGA's control registers.
\figref{fig:opencl-setup} illustrates the main mode of operation: to write
into the GPU, the physical bus addresses of the GPU buffers are determined
with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
host CPU in a control register of the FPGA (1). The FPGA then writes data
blocks autonomously in DMA fashion (2).  To signal events to the FPGA (4), the
control registers can be mapped into the GPU's address space passing a special
AMD-specific flag and passing the physical BAR address of the FPGA
configuration memory to the \texttt{cl\-Create\-Buffer} function. From the
GPU, this memory is seen transparently as regular GPU memory and can be
written accordingly (3). In our setup, trigger registers are used to notify
the FPGA on successful or failed evaluation of the data. Using the
\texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write
entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
as bus master and pushes data to the FPGA.

%% Double Buffering strategy. 

Due to hardware restrictions the largest possible GPU buffer sizes are about
95 MB but larger transfers can be achieved by using a double buffering
mechanism: data are copied from the buffer exposed to the FPGA into a
different location in GPU memory. To verify that we can keep up with the
incoming data throughput using this strategy, we measured the data throughput
within a GPU by copying data from a smaller sized buffer representing the DMA
buffer to a larger destination buffer. At a block size of about 384 KB the
throughput surpasses the maximum possible PCIe bandwidth, and it reaches 40
GB/s for blocks bigger than 5 MB. Double buffering is therefore a viable
solution for very large data transfers, where throughput performance is
favoured over latency. For data sizes less than 95 MB, we can determine all
addresses before the actual transfers thus keeping the CPU out of the transfer
loop.

%% Ufo Framework
To process the data, we encapsulated the DMA setup and memory mapping in a
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
This framework allows for an easy construction of streamed data processing on
heterogeneous multi-GPU systems. For example, to read data from the FPGA,
decode from its specific data format and run a Fourier transform on the GPU as
well as writing back the results to disk, one can run the following on the
command line:


\begin{verbatim}
ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
\end{verbatim}

The framework takes care of scheduling the tasks and distributing the data
items to one or more GPUs. High throughput is achieved by the combination of
fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
data item on a GPU using thousands of threads and by splitting the data stream
and feeding individual data items to separate GPUs. None of this requires any
user intervention and is solely determined by the framework in an automatized
fashion. A complementary application programming interface allows users to
develop custom applications written in C or high-level languages such as
Python.


%% --------------------------------------------------------------------------
\section{Results}


\begin{table}[b]
\centering
\small
\caption{Setups used for throughput and latency measurements}
\label{table:setups}
\tabcolsep=0.11cm
\begin{tabular}{@{}llll@{}}
  \toprule
 & Setup 1 & Setup 2 \\
  \midrule
CPU           & Intel Xeon E5-1630             & Intel Atom D525   \\
Chipset       & Intel C612                     & Intel ICH9R Express   \\
GPU           & AMD FirePro W9100              & AMD FirePro W9100   \\
PCIe slot: System memory    & x8 Gen3 & x4 Gen1    \\
PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
  \bottomrule
\end{tabular}
\end{table}

We carried out performance measurements on two different setups, which are
described in table~\ref{table:setups}. In both setups, a Xilinx VC709
evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
were connected to the same RC, as opposed to Setup 1. As stated in the
NVIDIA's GPUDirect documentation, the devices must share the same RC to
achieve the best performance.  In case of FPGA-to-CPU data
transfers, the software implementation is the one described
in~\cite{rota2015dma}.

%% --------------------------------------------------------------------------
\subsection{Throughput}

\begin{figure}[t]
  \includegraphics[width=0.85\textwidth]{figures/throughput}
  \caption{%
    Measured throughput for data transfers from FPGA to main memory
    (CPU) and from FPGA to the global GPU memory (GPU) using Setup 1.
}
\label{fig:throughput}
\end{figure}

In order to evaluate the maximum performance of the DMA engine, measurements
of pure data throughput were carried out using Setup 1. The results are shown
in \figref{fig:throughput} for transfers to the system's main memory as well
as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
double buffering mechanism was used. As one can see, in both cases the write
performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
and maximum performance depend on the different implementation of the
handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
Gen1 link limits the throughput to system main memory to around 700 MB/s.
However, transfers to GPU memory yielded the same results as Setup 1.

%% --------------------------------------------------------------------------
\subsection{Latency}


\begin{figure}[t]
  \centering
  \begin{subfigure}[b]{.49\textwidth}
    \centering
    \includegraphics[width=\textwidth]{figures/latency-cpu}
  
    \label{fig:latency-cpu}
  \vspace{-0.4\baselineskip}
    \caption{}
  \end{subfigure}
  \begin{subfigure}[b]{.49\textwidth}
    \includegraphics[width=\textwidth]{figures/latency-gpu}

    \label{fig:latency-gpu}
  \vspace{-0.4\baselineskip}
    \caption{}
    \end{subfigure}
  \caption{Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).}
  \label{fig:latency}
\end{figure}


We conducted the following test in order to measure the latency introduced by the DMA engine : 
1) the host starts a DMA transfer by issuing the \emph{start\_dma} command.
2) the DMA engine transmits data into the system main memory.
3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.

A counter on the FPGA measures the time interval between the \emph{start\_dma}
and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
the round-trip latency of the system. The correct ordering of the packets is
assured by the PCIe protocol. The measured round-trip latencies for data transfers to
system main memory and GPU memory are reported in \figref{fig:latency}.

When system main memory is used, latencies as low as 1.1 \textmu s are
achieved with Setup 1 for a packet size of 1024 B. The higher latency and the
dependance on size measured with Setup 2 are caused by the slower PCIe x4 Gen1
link connecting the FPGA board to the system main memory.

The same test was performed when transferring data inside GPU memory. Like in
the previous case, the notification was written into systen main memory. This
approach was used because the latency introduced by OpenCL scheduling in our
implementation (\~ 100-200 \textmu s) did not allow a precise measurement
based only on FPGA-GPU communication. When connecting the devices to the same
RC, as in Setup 2, a latency of 2 \textmu is achieved (limited by the latency
to system main memory, as seen in \figref{fig:latency}.a). On the contrary, if
the FPGA board and the GPU are connected to different RC as in Setup 1, the
latency increases significantly with packet size. It must be noted that the
low latencies measured with Setup 1 for packet sizes below 1 kB seem to be due
to a caching mechanism  inside the PCIe switch, and it is not clear whether
data has been successfully written into GPU memory when the notification is
delivered to the CPU. This effect must be taken into account in future
implementations as it could potentially lead to data corruption.
 
\section{Conclusion and outlook}

We developed a hardware and software solution that enables DMA transfers
between FPGA-based readout systems and GPU computing clusters.

The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
data transfer. The measurements on a low-end system based on an Intel Atom CPU
showed no significant difference in throughput performance. Depending on the
application and computing requirements, this result makes smaller acquisition
system a cost-effective alternative to larger workstations.

We measured a round-trip latency of 1 \textmu s when transfering data between
the DMA engine with system main memory. We also assessed the applicability of
DirectGMA in low latency applications: preliminary results shows that
latencies as low as 2 \textmu s can by achieved during data transfers to GPU
memory.  However, at the time of writing this paper, the latency introduced by
OpenCL scheduling is in the range of hundreds of \textmu s. Optimization of
the GPU-DMA interfacing OpenCL code is ongoing with the help of technical
support by AMD, in order to lift the current limitation and enable the use of
our implementation in low latency applications. Moreover, measurements show
that dedicated hardware must be employed in low latency applications.

In order to increase the total throughput, a custom FPGA evaluation board is
currently under development. The board mounts a Virtex-7 chip and features two
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
as a single x16 device by using an external PCIe switch. With two cores
operating in parallel, we foresee an increase in the data throughput by a
factor of 2 (as demonstrated in~\cite{rota2015dma}).

The proposed software solution allows seamless multi-GPU processing of
the incoming data, due to the integration in our streamed computing framework.
This allows straightforward integration with different DAQ systems and
introduction of custom data processing algorithms.

Support for NVIDIA's GPUDirect technology is also foreseen in the next months
to lift the limitation of one specific GPU vendor and compare the performance
of hardware by different vendors. Further improvements are expected by
generalizing the transfer mechanism and include Infiniband support besides the
existing PCIe connection.

Our goal is to develop a unique hybrid solution,
based on commercial standards, that includes fast data transmission protocols
and a high performance GPU computing framework.


\acknowledgments

This work was partially supported by the German-Russian BMBF funding programme,
grant numbers 05K10CKB and 05K10VKE.


\bibliographystyle{JHEP}
\bibliography{literature}

\end{document}