ufo
/
twepp2015


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348
							\documentclass{JINST}

\usepackage[utf8]{inputenc}
\usepackage{lineno}
\usepackage{ifthen}

\newboolean{draft}
\setboolean{draft}{true}

\newcommand{\figref}[1]{Figure~\ref{#1}}

\title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}

\author{N.~Zilio$^b$,
    M.~Weber$^a$\\
  \llap{$^a$}Institute for Data Processing and Electronics,\\
    Karlsruhe Institute of Technology (KIT),\\
    Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany\\
  \llap{$^b$}Somewhere in France eating Pate
}

\abstract{%
  Modern physics experiments have reached multi-GB/s data rates.  Fast data
  links and high performance computing stages are required for continuous
  acquisition and processing. Because of their intrinsic parallelism and
  % I would remove the computing from here and leave "ideal solution",
  % afterwards we have again computing...
  computational power, GPUs emerged as an ideal computing solution for high
  performance computing applications. To connect a fast data acquisition stage
  with a GPU's processing power, we developed an architecture consisting of a
  FPGA that includes a Direct Memory Access (DMA) engine compatible with the
  Xilinx PCI-Express core, a Linux driver for register access and high-level
  software to manage direct memory transfers using AMD's DirectGMA technology.
  Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
  implementation is suitable for real-time DAQ system applications ranging
  photon science and medical imaging to HEP experiment triggers.
}


\begin{document}

\ifdraft
\setpagewiselinenumbers
\linenumbers
\fi


\section{Motivation}

GPU computing has become the main driving force for high performance computing
due to an unprecedented parallelism and a low cost-benefit factor. GPU
acceleration has found its way into numerous applications, ranging from
simulation to image processing. Recent years have also seen an increasing
interest in GPU-based systems for HEP applications, which require a combination
of high data rates, high computational power and low latency (\emph{e.g.}
ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
science facilities have become comparable to those traditionally associated with
HEP.

In HEP experiments data is acquired by one or more read-out boards and then
transmitted to GPUs in short bursts or in a continuous streaming mode. With
expected data rates of several GB/s, the data transmission link between the
read-out boards and the host system may partially limit the overall system
performance. In particular, latency becomes the most stringent specification if
a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.

To address these problems we propose a complete hardware/software stack
architecture based on our own Direct Memory Access (DMA) design and integration
of AMD's DirectGMA technology into our processing pipeline. In our solution,
PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
host computer. Due to its high bandwidth and modularity, PCIe quickly became the
commercial standard for connecting high-throughput peripherals such as GPUs or
solid state disks.  Optical PCIe networks have been demonstrated
% JESUS: time span -> for, point in time -> since ...
% BUDDHA: Ok boss. I wanted to say "since 10 years ago...", is for ok?
for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
as a communication bus over long distances. In particular, in HEP DAQ systems,
optical links are preferred over electrical ones because of their superior
radiation hardness, lower power consumption and higher density.

%% Added some more here, I need better internet to find the correct references
Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
PCIe network interface card with NVIDIA's GPUDirect
integration~\cite{lonardo2015nanet}.  Due to its design, the bandwidth saturates
at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
a commercial PCIe engine.  Other solutions achieve higher throughput based on
Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
they do not provide support for direct FPGA-GPU communication.


\section{Architecture}

DMA data transfers are handled by dedicated hardware, which compared with
Programmed Input Output (PIO) access, offer lower latency and higher throughput
at the cost of higher system complexity.

\begin{figure}[t]
  \centering
  \includegraphics[width=1.0\textwidth]{figures/transf}
  \caption{%
    In a traditional DMA architecture (a), data is first written to the main
    system memory and then sent to the GPUs for final processing.  By using
    GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
    GPU's internal memory.
  }
  \label{fig:trad-vs-dgpu}
\end{figure}

As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
data through system main memory by copying data from the FPGA into intermediate
buffers and then finally into the GPU's main memory. Thus, the total throughput
of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
AMD's DirectGMA technologies allow direct communication between GPUs and
auxiliary devices over the PCIe bus.  By combining this technology with a DMA
data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
system is reduced and total throughput increased. Moreover, the CPU and main
system memory are relieved from processing because they are not directly
involved in the data transfer anymore.


\subsection{DMA engine implementation on the FPGA}

We have developed a DMA architecture that minimizes resource utilization while
maintaining the flexibility of a Scatter-Gather memory
policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
main system memory and GPU memory are both supported. Two FIFOs, with a data
width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
configured by the host through PIO registers.

The physical addresses of the host's memory buffers are stored into an internal
memory and are dynamically updated by the driver or user, allowing highly
efficient zero-copy data transfers. The maximum size associated with each
address is 2 GB.


\subsection{OpenCL management on host side}
\label{sec:host}

On the host side, AMD's DirectGMA technology, an implementation of the
bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
the FPGA to GPU memory and from the GPU to the FPGA's control registers.
\figref{fig:opencl-setup} illustrates the main mode of operation: to write into
the GPU, the physical bus addresses of the GPU buffers are determined with a call to
\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
control register of the FPGA (1). The FPGA then writes data blocks autonomously
in DMA fashion (2). 
% BUDDHA: This part is not true. We need to always do the handshaking if we transfer
% more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
Due to hardware restrictions the largest possible GPU buffer
sizes are about 95 MB but larger transfers can be achieved using a double
buffering mechanism. Because the GPU provides a flat memory address space and
our DMA engine allows multiple destination addresses to be set in advance, we
can determine all addresses before the actual transfers thus keeping the
CPU out of the transfer loop.
%% BUDDHA: the CPU is still involved in the loop at the moment. We didn't manage
% to move the handshaking completely to the GPU, did we?

To signal events to the FPGA (4), the control registers can be mapped into the
GPU's address space passing a special AMD-specific flag and passing the physical
BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
function. From the GPU, this memory is seen transparently and as regular GPU
memory and can be written accordingly (3). Individual write accesses are issued
as PIO commands, however using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
call it is also possible to write entire memory regions in a DMA fashion to the
FPGA. In this case, the GPU acts as bus master to push data to the FPGA.

\begin{figure}
  \centering
  \includegraphics[width=0.75\textwidth]{figures/opencl-setup}
  \caption{The FPGA writes to GPU memory by mapping the physical address of a
  GPU buffer and initating DMA transfers. Signalling happens in reverse order by
  mapping the FPGA control registers into the address space of the GPU.}
  \label{fig:opencl-setup}
\end{figure}

To process the data, we encapsulated the DMA setup and memory mapping in a
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
framework allows for an easy construction of streamed data processing on
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
its specific format and run a Fourier transform on the GPU as well as writing
back the results to disk, one can run on the command line:
% BUDDHA: I like this point very very much, formatting helps to make it stand out
\begin{verbatim}ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
\end{verbatim} %%
The framework will take care of scheduling the tasks and distribute the data items 
according.  A complementary application programming interface allows users to 
develop custom applications written in C or high-level languages such as Python. 
High throughput is achieved by the combination of fine- and coarse-grained data
parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
of threads and by splitting the data stream and feeding individual data items to
separate GPUs. None of this requires any user intervention and is solely
determined by the framework in an automatized fashion.


\section{Results}

We measured the performance using a Xilinx VC709 evaluation board plugged into a
desktop PC with an Intel Xeon E5-1630 3.7 GHz processor and an Intel C612
chipset.

Due to the size limitation of the DMA buffer as presented in Section
\ref{sec:host}, we have to copy several sub buffers in order to transfer data
larger than the maximum transfer size of 95 MB. In \figref{fig:intra-copy}, the
throughput for a copy from a smaller sized buffer (representing the DMA buffer)
to a larger buffer is shown. At a block size of about 384 KB, the throughput
surpasses the maximum possible PCIe bandwidth, thus making a double buffering
strategy a viable solution for very large data transfers.

\begin{figure}
  \includegraphics[width=\textwidth]{figures/intra-copy}
  \caption{%
    Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
    (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
    performance for smaller block sizes is caused by the larger amount of
    transfers required to fill the destination buffer. The throughput has been
    estimated using the host side wall clock time. On-GPU data transfer is about
    twice as fast.
    %% BUDDHA: forgive my ignorance: what does it mean "on-gpu"? 
  }
  \label{fig:intra-copy}
\end{figure}


\subsection{Throughput}

%% BUDDHA: why do we need to state this thing? High-throughput affects also the
%% total latency. One can optimize for one or the other probably, but at the moment
%% we use the same approach, so I would not write this.
A high throughput is desired for applications in which the FPGA outputs large
amounts of data and timing is not an issue. This includes fast, high resolution
photon detectors as used in synchrotron facilities.

\figref{fig:throughput} shows the memory write throughput for a GPU and the CPU
For both system and GPU memory, the write performance is primarily limited by
the PCIe bus. Higher payloads introduce less overhead, thus increasing the net
bandwidth. Up until 2 MB transfer size, the performance is almost the same,
after that the GPU transfer shows a slightly better slope. Data transfers larger
than 1 GB saturate the PCIe bus. \textbf{LR: We should measure the slope for
different page sizes, I expect the saturation point to change for different
page sizes}

\begin{figure}
  \centering
  \includegraphics[width=1.0\textwidth]{figures/throughput}
  \caption{
    Throughput of regular CPU and our GPU DMA data transfer for up to 50 GB of
    data.
  }
  \label{fig:throughput}
\end{figure}


\subsection{Latency}

%% Change the specs for the small crate
% MV: we never did anything in that regard
% LR: Nicholas did, and he said there was no difference in FPGA-GPU

% For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
% based on XXX and Intel Nano XXXX. The results does not show any significant difference
% compared to the previous setup, making it a more cost-effective solution.

\begin{figure}
  \includegraphics[width=\textwidth]{figures/latency-michele}
  \caption{%
    Relative frequency of measured latencies for a single 4 KB packet transfered
    from the GPU to the FPGA.
  }
  \label{fig:intra-copy}
\end{figure}

%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us

%% Here: instead of this useless plot, we can plot the latency vs different data sizes transmitted (from FPGA). It should reach 50% less for large data transfers, even with our current limitation... Maybe we can also try on a normal desktop ?
\begin{figure}
  \centering
  \includegraphics[width=0.6\textwidth]{figures/latency}
  \caption{%
    For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b). 
  }
  \label{fig:latency}
\end{figure}

% In case everything is fine.
\ref{fig:latency} shows the comparison between the traditional approach and GPU DMA data transfer.
The total latency is decreased 
The distribution of latency is shown in Figure \ref{fig:latency}.

%% EMERGENCY TEXT if we don't manage to fix the latency problem
The round-trip time of a memory read request issued from the CPU to the FPGA is less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the execution of DirectGMA functions. 


\section{Conclusion}

We developed a complete hardware and software solution that enables DMA
transfers between FPGA-based readout boards and GPU computing clusters. The net
throughput is primarily limited by the PCIe bus, reaching 6.x GB/s for a 256 B
payload. By writing directly into GPU memory instead of routing data through
system main memory, the overall latency is reduced by a factor of 2. Moreover,
the solution proposed here allows high performance GPU computing due to
integration of the DMA transfer setup in our streamed computing framework.
Integration with different DAQ systems and custom algorithms is therefore
immediate.

\subsection{Outlook}

%Add if we cannot fix latency
An optimization of the OpenCL code in ongoing, with the help of AMD technical support.
With a better understanding of the hardware and software aspects of DirectGMA, we expect 
a significant improvement in latency performance.  

Support for NVIDIA's GPUDirect technology is foreseen in the next months to
lift the limitation of one specific GPU vendor and direct performance comparison will be possible.

A custom FPGA evaluation board is currently under development in order to
increase the total throughput. The board mounts a Virtex-7 chip and features 2
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
single x16 device by using an external PCIe switch. With two cores operating in parallel,
we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).

\textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
A big house for all these love-lacking protocols.}

It is our intention to add Infiniband support. 
\textbf{I NEED TO READ
WHAT ARE THE ADVANTAGES VS PCIe. Update: internet sucks in China.}

%LR:Here comes the visionary Luigi
Our goal is to develop a unique hybrid solution, based
on commercial standards, that includes fast data transmission protocols and a high performance
GPU computing framework.

\acknowledgments

UFO? KSETA? Are you joking? Nope? You have to credit funding? Mafia.


\bibliographystyle{JHEP}
\bibliography{literature}

\end{document}