|
@@ -14,15 +14,19 @@
|
|
|
|
|
|
\title{A high-throughput readout architecture based on PCI-Express Gen3 and DirectGMA technology}
|
|
|
|
|
|
-\author{M.~Vogelgesang$^a$,
|
|
|
+\author{
|
|
|
L.~Rota$^a$,
|
|
|
+ M.~Vogelgesang$^a$,
|
|
|
N.~Zilio$^a$,
|
|
|
M.~Caselle$^a$,
|
|
|
+ S.~Chilingaryan$^a$,
|
|
|
L.E.~Ardila Perez$^a$,
|
|
|
+ M.~Balzer$^a$,
|
|
|
M.~Weber$^a$\\
|
|
|
- \llap{$^a$}Institute for Data Processing and Electronics,\\
|
|
|
+ \llap{$^a$}All authors Institute for Data Processing and Electronics,\\
|
|
|
Karlsruhe Institute of Technology (KIT),\\
|
|
|
- Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany
|
|
|
+ Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
|
|
|
+ E-mail: \email{lorenzo.rota@kit.edu}
|
|
|
}
|
|
|
|
|
|
\abstract{%
|
|
@@ -41,6 +45,7 @@
|
|
|
trigger systems.
|
|
|
}
|
|
|
|
|
|
+\keywords{AMD directGMA; FPGA; Readout architecture}
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
@@ -50,52 +55,66 @@
|
|
|
\fi
|
|
|
|
|
|
|
|
|
-\section{Motivation}
|
|
|
+\section{Introduction}
|
|
|
|
|
|
GPU computing has become the main driving force for high performance computing
|
|
|
due to an unprecedented parallelism and a low cost-benefit factor. GPU
|
|
|
acceleration has found its way into numerous applications, ranging from
|
|
|
simulation to image processing. Recent years have also seen an increasing
|
|
|
-interest in GPU-based systems for HEP applications, which require a combination
|
|
|
-of high data rates, high computational power and low latency (\emph{e.g.}
|
|
|
+interest in GPU-based systems for HEP experiments (\emph{e.g.}
|
|
|
ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
|
|
|
-PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
|
|
|
+PANDA~\cite{panda_gpu}). In a typical HEP scenario,
|
|
|
+data is acquired by one or more read-out boards and then
|
|
|
+transmitted in short bursts or in a continuous streaming mode to a computation stage.
|
|
|
+With expected data rates of several GB/s, the data transmission link between the
|
|
|
+read-out boards and the host system may partially limit the overall system
|
|
|
+performance. In particular, latency becomes the most stringent specification if
|
|
|
+a time-deterministic feedback is required, \emph{e.g.} in Low/High-level trigger
|
|
|
+systems. Moreover, the volumes of data produced in recent photon
|
|
|
science facilities have become comparable to those traditionally associated with
|
|
|
HEP.
|
|
|
|
|
|
-In HEP experiments data is acquired by one or more read-out boards and then
|
|
|
-transmitted to GPUs in short bursts or in a continuous streaming mode. With
|
|
|
-expected data rates of several GB/s, the data transmission link between the
|
|
|
-read-out boards and the host system may partially limit the overall system
|
|
|
-performance. In particular, latency becomes the most stringent specification if
|
|
|
-a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
|
|
|
+In order to achieve the best performance in terms of latency and bandwidth,
|
|
|
+data transfers are handled by a dedicated DMA controller, at the cost of higher
|
|
|
+system complexity.
|
|
|
|
|
|
To address these problems we propose a complete hardware/software stack
|
|
|
-architecture based on our own DMA design, and integration
|
|
|
-of AMD's DirectGMA technology into our processing pipeline. In our solution,
|
|
|
-PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
|
|
|
-host computer. Due to its high bandwidth and modularity, PCIe quickly became the
|
|
|
-commercial standard for connecting high-throughput peripherals such as GPUs or
|
|
|
-solid state disks. Moreover, optical PCIe networks have been demonstrated
|
|
|
-a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe
|
|
|
-as a communication bus over long distances. In particular, in HEP DAQ systems,
|
|
|
-optical links are preferred over electrical ones because of their superior
|
|
|
-radiation hardness, lower power consumption and higher density.
|
|
|
-
|
|
|
-Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
|
|
|
-PCIe network interface card with NVIDIA's GPUDirect
|
|
|
-integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
|
|
|
-at 120 MB/s for a 1472 byte large UDP datagram.
|
|
|
-Nieto et~al.\ presented a system that moves data from an FPGA to a GPU using
|
|
|
-GPUDirect and a PXIexpress data link that makes use of four PCIe 1.0 links
|
|
|
-\cite{nieto2015high}. Their system (as limited by the interconnect) achieves an
|
|
|
-average throughput of 870 MB/s with 1 KB block transfers.
|
|
|
-
|
|
|
-\section{Architecture}
|
|
|
-
|
|
|
-DMA data transfers are handled by dedicated hardware, which compared with
|
|
|
-Programmed Input Output (PIO) access, offer lower latency and higher throughput
|
|
|
-at the cost of higher system complexity.
|
|
|
+architecture based on our own DMA engine, and integration
|
|
|
+of AMD's DirectGMA technology into our processing pipeline.
|
|
|
+
|
|
|
+\section{Background}
|
|
|
+
|
|
|
+Several solutions for direct FPGA/GPU communication are reported in literature.
|
|
|
+All these are based on NVIDIA's GPUDirect technology.
|
|
|
+
|
|
|
+The first implementation was realized by Bittner and Ruf with the Speedy
|
|
|
+PCIe Core~\cite{bittner}. In their design, during an FPGA-to-GPU data transfers,
|
|
|
+the GPU acts as master and reading data
|
|
|
+from the FPGA. This solution limits the reported bandwidth and
|
|
|
+latency to, respectively, 514 MB/s and 40~$\mu$s.
|
|
|
+
|
|
|
+Lonardo et~al.\ achieved lower latencies with their NaNet design, an FPGA-based
|
|
|
+PCIe network interface card~\cite{lonardo2015nanet}.
|
|
|
+The Gbe link limits the latency performance of the system to a few tens of $\mu$s.
|
|
|
+If only the FPGA-to-GPU latency is considered, the measured values span between
|
|
|
+1~$\mu$s and 6~$\mu$s, depending on the datagram size. Due to its design,
|
|
|
+ the bandwidth saturates at 120 MB/s.
|
|
|
+
|
|
|
+Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
|
|
|
+of four PCIe 1.0 links~\cite{nieto2015high}.
|
|
|
+Their system (as limited by the interconnect) achieves an average throughput of
|
|
|
+870 MB/s with 1 KB block transfers.
|
|
|
+
|
|
|
+A higher throughput has been achieved with the FPGA\textsuperscript{2} framework
|
|
|
+by Thoma et~al.\cite{thoma}: 2454 MB/s using a 8x Gen2.0 data link.
|
|
|
+
|
|
|
+\section{Basic Concepts}
|
|
|
+
|
|
|
+In our solution, PCI-express (PCIe) has been chosen as a direct data link between FPGA boards and the
|
|
|
+host computer. Due to its high bandwidth and modularity, PCIe quickly became the com-
|
|
|
+mercial standard for connecting high-throughput peripherals such as GPUs or solid state disks.
|
|
|
+Moreover, optical PCIe networks have been demonstrated a decade ago [5], opening the possibility
|
|
|
+of using PCIe as a communication bus over long distances.
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
@@ -120,7 +139,6 @@ is reduced and total throughput increased. Moreover, the CPU and main system
|
|
|
memory are relieved from processing because they are not directly involved in
|
|
|
the data transfer anymore.
|
|
|
|
|
|
-
|
|
|
\subsection{DMA engine implementation on the FPGA}
|
|
|
|
|
|
We have developed a DMA architecture that minimizes resource utilization while
|