|
@@ -23,29 +23,30 @@
|
|
|
L.E.~Ardila Perez$^a$,
|
|
|
M.~Balzer$^a$,
|
|
|
M.~Weber$^a$\\
|
|
|
- \llap{$^a$}All authors Institute for Data Processing and Electronics,\\
|
|
|
+ \llap{$^a$}Institute for Data Processing and Electronics,\\
|
|
|
Karlsruhe Institute of Technology (KIT),\\
|
|
|
Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
|
|
|
- E-mail: \email{lorenzo.rota@kit.edu}
|
|
|
+ E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
|
|
|
}
|
|
|
|
|
|
\abstract{%
|
|
|
Modern physics experiments have reached multi-GB/s data rates. Fast data
|
|
|
links and high performance computing stages are required for continuous
|
|
|
- acquisition and processing. Because of their intrinsic parallelism and
|
|
|
+ data acquisition and processing. Because of their intrinsic parallelism and
|
|
|
computational power, GPUs emerged as an ideal solution for high
|
|
|
- performance computing applications. To connect a fast data acquisition stage
|
|
|
- with a GPU's processing power, we developed an architecture consisting of a
|
|
|
- FPGA that includes a Direct Memory Access (DMA) engine compatible with the
|
|
|
+ performance computing applications. In this paper we present a high-throughput
|
|
|
+ platform based on direct FPGA-GPU communication and preliminary
|
|
|
+ results are reported.
|
|
|
+ The architecture consists of a Direct Memory Access (DMA) engine compatible with the
|
|
|
Xilinx PCI-Express core, a Linux driver for register access, and high-level
|
|
|
software to manage direct memory transfers using AMD's DirectGMA technology.
|
|
|
- Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
|
|
|
- implementation is suitable for real-time DAQ system applications ranging
|
|
|
+ Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s and a latency of
|
|
|
+ XXX \textmu s.
|
|
|
+ Our implementation is suitable for real-time DAQ system applications ranging
|
|
|
from photon science and medical imaging to High Energy Physics (HEP)
|
|
|
trigger systems.
|
|
|
}
|
|
|
-
|
|
|
-\keywords{AMD directGMA; FPGA; Readout architecture}
|
|
|
+\keywords{FPGA; GPU; PCI-Express; openCL; directGPU; directGMA}
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
@@ -61,66 +62,69 @@ GPU computing has become the main driving force for high performance computing
|
|
|
due to an unprecedented parallelism and a low cost-benefit factor. GPU
|
|
|
acceleration has found its way into numerous applications, ranging from
|
|
|
simulation to image processing. Recent years have also seen an increasing
|
|
|
-interest in GPU-based systems for HEP experiments (\emph{e.g.}
|
|
|
+interest in GPU-based systems for High Energy Physics (HEP) experiments (\emph{e.g.}
|
|
|
ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
|
|
|
PANDA~\cite{panda_gpu}). In a typical HEP scenario,
|
|
|
-data is acquired by one or more read-out boards and then
|
|
|
-transmitted in short bursts or in a continuous streaming mode to a computation stage.
|
|
|
-With expected data rates of several GB/s, the data transmission link between the
|
|
|
-read-out boards and the host system may partially limit the overall system
|
|
|
-performance. In particular, latency becomes the most stringent specification if
|
|
|
-a time-deterministic feedback is required, \emph{e.g.} in Low/High-level trigger
|
|
|
-systems. Moreover, the volumes of data produced in recent photon
|
|
|
+data are acquired by back-end readout systems and then
|
|
|
+transmitted in short bursts or in a continuous streaming mode to a computing stage.
|
|
|
+
|
|
|
+With expected data rates of several GB/s, the data transmission link may partially
|
|
|
+limit the overall system performance.
|
|
|
+In particular, latency becomes the most stringent requirement for time-deterministic applications,
|
|
|
+\emph{e.g.} in Low/High-level trigger systems.
|
|
|
+Furthermore, the volumes of data produced in recent photon
|
|
|
science facilities have become comparable to those traditionally associated with
|
|
|
HEP.
|
|
|
|
|
|
In order to achieve the best performance in terms of latency and bandwidth,
|
|
|
-data transfers are handled by a dedicated DMA controller, at the cost of higher
|
|
|
+data transfers must be handled by a dedicated DMA controller, at the cost of higher
|
|
|
system complexity.
|
|
|
|
|
|
To address these problems we propose a complete hardware/software stack
|
|
|
-architecture based on our own DMA engine, and integration
|
|
|
-of AMD's DirectGMA technology into our processing pipeline.
|
|
|
+architecture based on a high-performance DMA engine implemented on Xilinx FPGAs,
|
|
|
+and integration of AMD's DirectGMA technology into our processing pipeline.
|
|
|
+
|
|
|
+In our solution, PCI-express (PCIe) has been chosen as a direct data link between
|
|
|
+FPGA boards and the host computer. Due to its high bandwidth and modularity,
|
|
|
+PCIe quickly became the commercial standard for connecting high-throughput
|
|
|
+peripherals such as GPUs or solid state disks.
|
|
|
+
|
|
|
+Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie},
|
|
|
+opening the possibility of using PCIe as a communication bus over long distances.
|
|
|
|
|
|
\section{Background}
|
|
|
|
|
|
-Several solutions for direct FPGA/GPU communication are reported in literature.
|
|
|
-All these are based on NVIDIA's GPUDirect technology.
|
|
|
+Several solutions for direct FPGA/GPU communication are reported in literature,
|
|
|
+and all of them are based on NVIDIA's GPUdirect technology.
|
|
|
+
|
|
|
+In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master
|
|
|
+during an FPGA-to-GPU data transfer, reading data from the FPGA.
|
|
|
+This solution limits the reported bandwidth and
|
|
|
+latency to, respectively, 514 MB/s and 40~\textmu s.
|
|
|
|
|
|
-The first implementation was realized by Bittner and Ruf with the Speedy
|
|
|
-PCIe Core~\cite{bittner}. In their design, during an FPGA-to-GPU data transfers,
|
|
|
-the GPU acts as master and reading data
|
|
|
-from the FPGA. This solution limits the reported bandwidth and
|
|
|
-latency to, respectively, 514 MB/s and 40~$\mu$s.
|
|
|
+When the FPGA is used as a master a higher throughput can be achieved.
|
|
|
+An example of this approach is the FPGA\textsuperscript{2} framework
|
|
|
+by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
|
|
|
|
|
|
-Lonardo et~al.\ achieved lower latencies with their NaNet design, an FPGA-based
|
|
|
+Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
|
|
|
PCIe network interface card~\cite{lonardo2015nanet}.
|
|
|
-The Gbe link limits the latency performance of the system to a few tens of $\mu$s.
|
|
|
+The Gbe link however limits the latency performance of the system to a few tens of $\mu$s.
|
|
|
If only the FPGA-to-GPU latency is considered, the measured values span between
|
|
|
-1~$\mu$s and 6~$\mu$s, depending on the datagram size. Due to its design,
|
|
|
- the bandwidth saturates at 120 MB/s.
|
|
|
+1~$\mu$s and 6~$\mu$s, depending on the datagram size. Moreover,
|
|
|
+the bandwidth saturates at 120 MB/s.
|
|
|
|
|
|
Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
|
|
|
of four PCIe 1.0 links~\cite{nieto2015high}.
|
|
|
Their system (as limited by the interconnect) achieves an average throughput of
|
|
|
870 MB/s with 1 KB block transfers.
|
|
|
|
|
|
-A higher throughput has been achieved with the FPGA\textsuperscript{2} framework
|
|
|
-by Thoma et~al.\cite{thoma}: 2454 MB/s using a 8x Gen2.0 data link.
|
|
|
-
|
|
|
-\section{Basic Concepts}
|
|
|
-
|
|
|
-In our solution, PCI-express (PCIe) has been chosen as a direct data link between FPGA boards and the
|
|
|
-host computer. Due to its high bandwidth and modularity, PCIe quickly became the com-
|
|
|
-mercial standard for connecting high-throughput peripherals such as GPUs or solid state disks.
|
|
|
-Moreover, optical PCIe networks have been demonstrated a decade ago [5], opening the possibility
|
|
|
-of using PCIe as a communication bus over long distances.
|
|
|
+\section{Basic Concept}
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
|
\includegraphics[width=1.0\textwidth]{figures/transf}
|
|
|
\caption{%
|
|
|
- In a traditional DMA architecture (a), data is first written to the main
|
|
|
+ In a traditional DMA architecture (a), data are first written to the main
|
|
|
system memory and then sent to the GPUs for final processing. By using
|
|
|
GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
|
|
|
GPU's internal memory.
|
|
@@ -130,14 +134,14 @@ of using PCIe as a communication bus over long distances.
|
|
|
|
|
|
As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
|
|
|
data through system main memory by copying data from the FPGA into intermediate
|
|
|
-buffers and then finally into the GPU's main memory. Thus, the total throughput
|
|
|
-of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
|
|
|
-AMD's DirectGMA technologies allow direct communication between GPUs and
|
|
|
-auxiliary devices over the PCIe bus. By combining this technology with DMA data
|
|
|
-transfers (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the system
|
|
|
-is reduced and total throughput increased. Moreover, the CPU and main system
|
|
|
-memory are relieved from processing because they are not directly involved in
|
|
|
-the data transfer anymore.
|
|
|
+buffers and then finally into the GPU's main memory.
|
|
|
+Thus, the total throughput and latency of the system is limited by the main
|
|
|
+memory bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow
|
|
|
+direct communication between GPUs and auxiliary devices over the PCIe bus.
|
|
|
+By combining this technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)),
|
|
|
+the overall latency of the system is reduced and total throughput increased.
|
|
|
+Moreover, the CPU and main system memory are relieved from processing because
|
|
|
+they are not directly involved in the data transfer anymore.
|
|
|
|
|
|
\subsection{DMA engine implementation on the FPGA}
|
|
|
|
|
@@ -155,8 +159,23 @@ engine are configured by the host through PIO registers.
|
|
|
The physical addresses of the host's memory buffers are stored into an internal
|
|
|
memory and are dynamically updated by the driver or user, allowing highly
|
|
|
efficient zero-copy data transfers. The maximum size associated with each
|
|
|
-address is 2 GB.
|
|
|
-
|
|
|
+address is 2 GB. The engine fully supports 64-bit addresses. The resource utilization
|
|
|
+on a Virtex 7 device is reported in \ref{table:utilization}.
|
|
|
+
|
|
|
+% Please add the following required packages to your document preamble:
|
|
|
+% \usepackage{booktabs}
|
|
|
+\begin{table}[]
|
|
|
+\centering
|
|
|
+\caption{Resource utilization on}
|
|
|
+\label{table:utilization}
|
|
|
+\begin{tabular}{@{}llll@{}}
|
|
|
+Resource & Utilization & Available & Utilization \% \\\hline
|
|
|
+LUT & 5331 & 433200 & 1.23 \\
|
|
|
+LUTRAM & 56 & 174200 & 0.03 \\
|
|
|
+FF & 5437 & 866400 & 0.63 \\
|
|
|
+BRAM & 20.50 & 1470 & 1.39 \\\hline
|
|
|
+\end{tabular}
|
|
|
+\end{table}
|
|
|
|
|
|
\subsection{OpenCL management on host side}
|
|
|
\label{sec:host}
|
|
@@ -182,11 +201,11 @@ GPU's address space passing a special AMD-specific flag and passing the physical
|
|
|
BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
|
|
|
function. From the GPU, this memory is seen transparently as regular GPU memory
|
|
|
and can be written accordingly (3). In our setup, trigger registers are used to
|
|
|
-notify the FPGA on successful or failed evaluation of the data. These individual
|
|
|
-write accesses are issued as PIO commands which can at most transfer 32-bit
|
|
|
-words. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
|
|
|
-to circumvent this limitation and write entire memory regions in DMA fashion to
|
|
|
-the FPGA. In this case, the GPU acts as bus master and pushes data to the FPGA.
|
|
|
+notify the FPGA on successful or failed evaluation of the data.
|
|
|
+
|
|
|
+Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
|
|
|
+to write entire memory regions in DMA fashion to the FPGA.
|
|
|
+In this case, the GPU acts as bus master and pushes data to the FPGA.
|
|
|
|
|
|
\begin{figure}
|
|
|
\centering
|
|
@@ -203,9 +222,11 @@ framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
|
|
|
from its specific data format and run a Fourier transform on the GPU as well as
|
|
|
writing back the results to disk, one can run the following on the command line:
|
|
|
+
|
|
|
\begin{verbatim}
|
|
|
ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
|
|
|
\end{verbatim}
|
|
|
+
|
|
|
The framework takes care of scheduling the tasks and distributing the data items
|
|
|
to one or more GPUs. High throughput is achieved by the combination of fine-
|
|
|
and coarse-grained data parallelism, \emph{i.e.} processing a single data item
|
|
@@ -309,12 +330,15 @@ run-time behaviour of the operating system scheduler.
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
|
We developed a complete hardware and software solution that enables DMA
|
|
|
-transfers between FPGA-based readout boards and GPU computing clusters with
|
|
|
-reasonable performance characteristics. The net throughput is primarily limited
|
|
|
+transfers between FPGA-based readout boards and GPU computing clusters.
|
|
|
+
|
|
|
+The net throughput is primarily limited
|
|
|
by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
|
|
|
-CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
|
|
|
-of routing data through system main memory, the overall latency can be reduced
|
|
|
+CPU-based data transfer.
|
|
|
+
|
|
|
+By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
|
|
|
by a factor of two allowing close massively parallel computation on GPUs.
|
|
|
+
|
|
|
Moreover, the software solution that we proposed allows seamless multi-GPU
|
|
|
processing of the incoming data, due to the integration in our streamed computing
|
|
|
framework. This allows straightforward integration with different DAQ systems
|
|
@@ -333,7 +357,10 @@ fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
|
|
|
x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
|
|
|
a single x16 device by using an external PCIe switch. With two cores operating
|
|
|
in parallel, we foresee an increase in the data throughput by a factor of 2 (as
|
|
|
-demonstrated in~\cite{rota2015dma}). Further improvements are expected by
|
|
|
+demonstrated in~\cite{rota2015dma}).
|
|
|
+
|
|
|
+%% Where do we get this values? Any reference?
|
|
|
+Further improvements are expected by
|
|
|
generalizing the transfer mechanism and include Infiniband support besides the
|
|
|
existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
|
|
|
low as 0.5 \textmu s.
|