ufo
/
twepp2015


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308
							\documentclass{JINST}

\usepackage{lineno}
\usepackage{ifthen}

\newboolean{draft}
\setboolean{draft}{true}

\newcommand{\figref}[1]{Figure~\ref{#1}}

\title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}

\author{N.~Zilio$^b$,
    M.~Weber$^a$\\
  \llap{$^a$}Institute for Data Processing and Electronics,\\
    Karlsruhe Institute of Technology (KIT),\\
    Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany\\
  \llap{$^b$}Somewhere in France eating Pate
}

\abstract{%
  %% Old
  % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
  % data links.}
  %proposal for new abstract, including why do we need GPUs
  Modern physics experiments have reached multi-GB/s data rates.  Fast data
  links and high performance computing stages are required for continuous
  acquisition and processing. Because of their intrinsic parallelism and
  computational power, GPUs emerged as an ideal computing solution for high
  performance computing applications. To connect a fast data acquisition stage
  with a GPU's processing power, we developed an architecture consisting of a
  FPGA that includes a Direct Memory Access (DMA) engine compatible with the
  Xilinx PCI-Express core, a Linux driver for register access and high-level
  software to manage direct memory transfers using AMD's DirectGMA technology.
  Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
  implementation is suitable for real-time DAQ system applications ranging
  photon science and medical imaging to HEP experiment triggers.
}


\begin{document}

\ifdraft
\setpagewiselinenumbers
\linenumbers
\fi


\section{Motivation}

GPU computing has become the main driving force for high performance computing
due to an unprecedented parallelism and a low cost-benefit factor. GPU
acceleration has found its way into numerous applications, ranging from
simulation to image processing. Recent years have also seen an increasing
interest in GPU-based systems for HEP applications, which require a combination
of high data rates, high computational power and low latency (\emph{e.g.}
ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
science facilities have become comparable to those traditionally associated with
HEP.

In HEP experiments data is acquired by one or more read-out boards and then
transmitted to GPUs in short bursts or in a continuous streaming mode. With
expected data rates of several GB/s, the data transmission link between the
read-out boards and the host system may partially limit the overall system
performance. In particular, latency becomes the most stringent specification if
a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.

To address these problems we propose a complete hardware/software stack
architecture based on our own Direct Memory Access (DMA) design and integration
of AMD's DirectGMA technology into our processing pipeline. In our solution,
PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
host computer. Due to its high bandwidth and modularity, PCIe quickly became the
commercial standard for connecting high-throughput peripherals such as GPUs or
solid state disks.  Optical PCIe networks have been demonstrated
% JESUS: time span -> for, point in time -> since ...
for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
as a communication bus over long distances. In particular, in HEP DAQ systems,
optical links are preferred over electrical ones because of their superior
radiation hardness, lower power consumption and higher density.

%% Added some more here, I need better internet to find the correct references
Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
PCIe network interface card with NVIDIA's GPUDirect
integration~\cite{lonardo2015nanet}.  Due to its design, the bandwidth saturates
at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
a commercial PCIe engine.  Other solutions achieve higher throughput based on
Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
they do not provide support for direct FPGA-GPU communication.


\section{Architecture}

DMA data transfers are handled by dedicated hardware, which compared with
Programmed Input Output (PIO) access, offer lower latency and higher throughput
at the cost of higher system complexity.


\begin{figure}[t]
  \centering
  \includegraphics[width=1.0\textwidth]{figures/transf}
  \caption{%
    In a traditional DMA architecture (a), data is first written to the main
    system memory and then sent to the GPUs for final processing.  By using
    GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
    GPU's internal memory.
  }
  \label{fig:trad-vs-dgpu}
\end{figure}

As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
data through system main memory by copying data from the FPGA into intermediate
buffers and then finally into the GPU's main memory. Thus, the total throughput
of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
AMD's DirectGMA technologies allow direct communication between GPUs and
auxiliary devices over the PCIe bus.  By combining this technology with a DMA
data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
system is reduced and total throughput increased. Moreover, the CPU and main
system memory are relieved from processing because they are not directly
involved in the data transfer anymore.


\subsection{DMA engine implementation on the FPGA}

We have developed a DMA architecture that minimizes resource utilization while
maintaining the flexibility of a Scatter-Gather memory
policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
main system memory and GPU memory are both supported. Two FIFOs, with a data
width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
configured by the host through PIO registers.

The physical addresses of the host's memory buffers are stored into an internal
memory and are dynamically updated by the driver or user, allowing highly
efficient zero-copy data transfers. The maximum size associated with each
address is 2 GB.


\subsection{OpenCL management on host side}
\label{sec:host}

On the host side, AMD's DirectGMA technology, an implementation of the
bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
the FPGA to GPU memory and from the GPU to the FPGA's control registers.
\figref{fig:opencl-setup} illustrates the main mode of operation: To write into
the GPU, the physical bus address of the GPU buffer is determined with a call to
\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
control register of the FPGA (1). The FPGA then writes data blocks autonomously
in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
sizes are about 95 MB but larger transfers can be achieved using a double
buffering mechanism. Because the GPU provides a flat memory address space and
our DMA engine allows multiple destination addresses to be set in advance, we
can determine all addresses before the actual transfers thus keeping the
CPU out of the transfer loop.

To signal events to the FPGA (4), the control registers can be mapped into the
GPU's address space passing a special AMD-specific flag and passing the physical
BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
function. From the GPU, this memory is seen transparently and as regular GPU
memory and can be written accordingly (3).

\begin{figure}
  \centering
  \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  \caption{The FPGA writes to GPU memory by mapping the physical address of a
  GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  mapping the FPGA control registers into the address space of the GPU.}
  \label{fig:opencl-setup}
\end{figure}

To process the data, we encapsulated the DMA setup and memory mapping in a
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
framework allows for an easy construction of streamed data processing on
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
its specific format and run a Fourier transform on the GPU as well as writing
back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
fft ! write filename=out.raw} on the command line. The framework will take care
of scheduling the tasks and distribute the data items according.  A
complementary application programming interface allows users to develop custom
applications written in C or high-level languages such as Python. High
throughput is achieved by the combination of fine- and coarse-grained data
parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
of threads and by splitting the data stream and feeding individual data items to
separate GPUs. None of this requires any user intervention and is solely
determined by the framework in an automatized fashion.


\section{Results}

We measured the performance using a Xilinx VC709 evaluation board plugged into a
desktop PC with an Intel Xeon E5-1630 3.7 GHz processor and an Intel C612
chipset.

Due to the size limitation of the DMA buffer as presented in Section
\ref{sec:host}, we have to copy several sub buffers in order to transfer data
larger than the maximum transfer size of 95 MB. In \figref{fig:intra-copy}, the
throughput for a copy from a smaller sized buffer (representing the DMA buffer)
to a larger buffer is shown. At a block size of about 384 KB, the throughput
surpasses the maximum possible PCIe bandwidth, thus making a double buffering
strategy a viable solution for very large data transfers.

\begin{figure}
  \includegraphics[width=\textwidth]{figures/intra-copy}
  \caption{%
    Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
    (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
    performance for smaller block sizes is caused by the larger amount of
    transfers required to fill the destination buffer. The throughput has been
    estimated using the host side wall clock time. On-GPU data transfer is about
    twice as fast.
  }
  \label{fig:intra-copy}
\end{figure}

\subsection{Throughput}


\subsection{Latency}

%% Change the specs for the small crate
For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
based on XXX and Intel Nano XXXX. The results does not show any significant difference
compared to the previous setup, making it a more cost-effective solution.

\begin{figure}
  \includegraphics[width=\textwidth]{figures/latency-michele}
  \caption{%
    FILL ME
  }
  \label{fig:intra-copy}
\end{figure}

\begin{figure}
  \centering
  \includegraphics[width=0.6\textwidth]{figures/through_plot}
  \caption{
    Writing from the FPGA to either system or GPU memory is primarily limited by
    the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth.
    Up until 2 MB transfer size, the performance is almost the
    same, after that the GPU transfer shows a slightly better slope. Data
    transfers larger than 1 GB saturate the PCIe bus.
    \bf{LR: We should measure the slope for different page sizes, I expect the saturation point
    to change for different page sizes}
  }
  \label{fig:throughput}
\end{figure}

%% Latency here? What do we do?
%% We should add an histogram with 1000+ measurements of the latency to see if it's time-deterministic
%% Also: add a plot of latency vs different data sizes transmitted (from FPGA)
\begin{figure}
  \centering
  \includegraphics[width=0.6\textwidth]{figures/latency}
  \caption{%
    The data transmission latency is decreased by XXX percent with respect to the traditional
    approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
    for a 4k pac
  }
  \label{fig:latency}
\end{figure}


\section{Conclusion}
%% Added here
We developed a complete hardware and software solution that enable direct DMA transfers
between FPGA-based readout boards and GPU computing clusters. The net throughput is mainly
limited by the PCIe bus, reaching 6.7 GB/s for a 256 B payload. By writing directly into GPU
memory instead of routing data through system main memory, latency is reduced by a factor of 2.
The solution here proposed allows high performance GPU computing thanks to the support of the
framework. Integration with different DAQ systems and custom algorithms is therefore immediate.


\subsection{Outlook}

Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
lift the limitation of one specific GPU vendor and direct performance comparison will be possible.

A custom FPGA evaluation board is currently under development in order to
increase the total throughput. The board mounts a Virtex-7 chip and features 2
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
single x16 device by using an external PCIe switch. With two cores operating in parallel,
we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).

\textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
A big house for all these love-lacking protocols.}

It is our intention to add Infiniband support. I NEED TO READ
WHAT ARE THE ADVANTAGES VS PCIe.

\textbf{LR:Here comes the visionary Luigi...}
Our goal is to develop a unique hybrid solution, based
on commercial standards, that includes fast data transmission protocols and a high performance
GPU computing framework.

\acknowledgments

UFO? KSETA? Are you joking? Nope? You have to credit funding? Mafia.


\bibliographystyle{JHEP}
\bibliography{literature}

\end{document}