|
@@ -17,16 +17,17 @@
|
|
|
\author{
|
|
|
L.~Rota$^a$,
|
|
|
M.~Vogelgesang$^a$,
|
|
|
- N.~Zilio$^a$,
|
|
|
- M.~Caselle$^a$,
|
|
|
- S.~Chilingaryan$^a$,
|
|
|
L.E.~Ardila Perez$^a$,
|
|
|
M.~Balzer$^a$,
|
|
|
+ M.~Caselle$^a$,
|
|
|
+ S.~Chilingaryan$^a$,
|
|
|
+ T.~Dritschler$^a$,
|
|
|
M.~Weber$^a$\\
|
|
|
+ N.~Zilio$^a$,
|
|
|
\llap{$^a$}Institute for Data Processing and Electronics,\\
|
|
|
Karlsruhe Institute of Technology (KIT),\\
|
|
|
Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
|
|
|
- E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
|
|
|
+ E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
|
|
|
}
|
|
|
|
|
|
\abstract{%
|
|
@@ -35,12 +36,12 @@
|
|
|
acquisition and processing. Because of their intrinsic parallelism and
|
|
|
computational power, GPUs emerged as an ideal solution to process this data in
|
|
|
high performance computing applications. In this paper we present a
|
|
|
- high-throughput platform based on direct FPGA-GPU communication and
|
|
|
- preliminary latency and throughput results. The architecture consists of a
|
|
|
+ high-throughput platform based on direct FPGA-GPU communication.
|
|
|
+ The architecture consists of a
|
|
|
Direct Memory Access (DMA) engine compatible with the Xilinx PCI-Express core,
|
|
|
a Linux driver for register access, and high-level software to manage direct
|
|
|
- memory transfers using AMD's DirectGMA technology. Measurements with a Gen3
|
|
|
- x8 link shows a throughput of up to 6.4 GB/s and a latency of XXX \textmu s.
|
|
|
+ memory transfers using AMD's DirectGMA technology. Preliminary measurements with a Gen3
|
|
|
+ x8 link show a throughput of up to 6.4 GB/s and a latency of 40 \textmu s.
|
|
|
Our implementation is suitable for real-time DAQ system applications ranging
|
|
|
from photon science and medical imaging to High Energy Physics (HEP) trigger
|
|
|
systems.
|
|
@@ -74,67 +75,48 @@ in Low/High-level trigger systems. Furthermore, the amount of data produced in
|
|
|
current generation photon science facilities have become comparable to those
|
|
|
traditionally associated with HEP.
|
|
|
|
|
|
-% MV: too many little paragraphs in my opinion
|
|
|
-In order to achieve the best performance in terms of latency and bandwidth,
|
|
|
-data transfers must be handled by a dedicated DMA controller, at the cost of higher
|
|
|
-system complexity.
|
|
|
-
|
|
|
-To address these problems we propose a complete hardware/software stack
|
|
|
-architecture based on a high-performance DMA engine implemented on Xilinx FPGAs,
|
|
|
-and integration of AMD's DirectGMA technology into our processing pipeline.
|
|
|
-
|
|
|
-In our solution, PCI-express (PCIe) has been chosen as a direct data link between
|
|
|
-FPGA boards and the host computer. Due to its high bandwidth and modularity,
|
|
|
+Due to its high bandwidth and modularity,
|
|
|
PCIe quickly became the commercial standard for connecting high-throughput
|
|
|
-peripherals such as GPUs or solid state disks.
|
|
|
-
|
|
|
-Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie},
|
|
|
-opening the possibility of using PCIe as a communication bus over long distances.
|
|
|
-
|
|
|
-
|
|
|
-\section{Background}
|
|
|
-
|
|
|
-Several solutions for direct FPGA/GPU communication are reported in literature,
|
|
|
-and all of them are based on NVIDIA's GPUdirect technology.
|
|
|
+peripherals such as GPUs or solid state disks.Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe as a communication bus over long distances.
|
|
|
|
|
|
+Several solutions for direct FPGA/GPU communication based on PCIe are reported
|
|
|
+in literature, and all of them are based on NVIDIA's GPUdirect technology.
|
|
|
In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master
|
|
|
during an FPGA-to-GPU data transfer, reading data from the FPGA. This solution
|
|
|
limits the reported bandwidth and latency to 514 MB/s and 40~\textmu s,
|
|
|
respectively.
|
|
|
-
|
|
|
+%LR: FPGA^2 it's the name of their thing...
|
|
|
When the FPGA is used as a master, a higher throughput can be achieved. An
|
|
|
-example of this approach is the FPGAs
|
|
|
-% MV: what was this superscript about?
|
|
|
-%\textsuperscript{2}
|
|
|
+example of this approach is the FPGA\textsuperscript{2}
|
|
|
framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0
|
|
|
data link.
|
|
|
-
|
|
|
Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
|
|
|
PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
|
|
|
limits the latency performance of the system to a few tens of \textmu s. If only
|
|
|
the FPGA-to-GPU latency is considered, the measured values span between
|
|
|
1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
|
|
|
bandwidth saturates at 120 MB/s.
|
|
|
-
|
|
|
Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
|
|
|
of four PCIe 1.0 links~\cite{nieto2015high}.
|
|
|
Their system (as limited by the interconnect) achieves an average throughput of
|
|
|
870 MB/s with 1 KB block transfers.
|
|
|
|
|
|
+In order to achieve the best performance in terms of latency and bandwidth,
|
|
|
+we developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.
|
|
|
|
|
|
-\section{Basic concept}
|
|
|
+To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
|
+plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
+framework allows for an easy construction of streamed data processing on
|
|
|
+heterogeneous multi-GPU systems. However, the framework is based on OpenCL,
|
|
|
+and therefore integration with NVIDIA's CUDA functions for GPUDirect technology
|
|
|
+is not possible.
|
|
|
|
|
|
-\begin{figure}[t]
|
|
|
- \centering
|
|
|
- \includegraphics[width=1.0\textwidth]{figures/transf}
|
|
|
- \caption{%
|
|
|
- In a traditional DMA architecture (a), data are first written to the main
|
|
|
- system memory and then sent to the GPUs for final processing. By using
|
|
|
- GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
|
|
|
- GPU's internal memory.
|
|
|
- }
|
|
|
- \label{fig:trad-vs-dgpu}
|
|
|
-\end{figure}
|
|
|
+We therefore integrated direct FPGA-to-GPU communication into our processing pipeline
|
|
|
+using AMD's DirectGMA technology. In this paper we report the performance of our
|
|
|
+DMA engine for FPGA-to-CPU communication and the first preliminary results with
|
|
|
+DirectGMA technology.
|
|
|
+
|
|
|
+\section{Architecture}
|
|
|
|
|
|
As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
|
|
|
data through system main memory by copying data from the FPGA into intermediate
|
|
@@ -147,6 +129,18 @@ the overall latency of the system is reduced and total throughput increased.
|
|
|
Moreover, the CPU and main system memory are relieved from processing because
|
|
|
they are not directly involved in the data transfer anymore.
|
|
|
|
|
|
+\begin{figure}[t]
|
|
|
+ \centering
|
|
|
+ \includegraphics[width=1.0\textwidth]{figures/transf}
|
|
|
+ \caption{%
|
|
|
+ In a traditional DMA architecture (a), data are first written to the main
|
|
|
+ system memory and then sent to the GPUs for final processing. By using
|
|
|
+ GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
|
|
|
+ GPU's internal memory.
|
|
|
+ }
|
|
|
+ \label{fig:trad-vs-dgpu}
|
|
|
+\end{figure}
|
|
|
+
|
|
|
\subsection{DMA engine implementation on the FPGA}
|
|
|
|
|
|
We have developed a DMA architecture that minimizes resource utilization while
|
|
@@ -155,9 +149,7 @@ policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
|
|
|
IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
|
|
|
main system memory and GPU memory are both supported. Two FIFOs, with a data
|
|
|
width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
|
|
|
-the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to
|
|
|
-saturate a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe
|
|
|
-3.0 x8 link with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA
|
|
|
+the custom logic with an input bandwidth of 7.45 GB/s. The user logic and the DMA
|
|
|
engine are configured by the host through PIO registers.
|
|
|
|
|
|
The physical addresses of the host's memory buffers are stored into an internal
|
|
@@ -245,6 +237,8 @@ develop custom applications written in C or high-level languages such as Python.
|
|
|
We carried out performance measurements on a machine with an Intel Xeon E5-1630
|
|
|
at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux 3.11.10. The
|
|
|
Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
|
|
|
+In case of FPGA-to-CPU data transfers, the software implementation is the one
|
|
|
+described in~\cite{rota2015dma}.
|
|
|
|
|
|
\begin{figure}
|
|
|
\centering
|
|
@@ -261,7 +255,7 @@ Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
|
|
|
\caption{%
|
|
|
Latency distribution.
|
|
|
% for a single 4 KB packet transferred
|
|
|
- % from FPGA to CPU and FPGA to GPU.
|
|
|
+ % from FPGA-to-CPU and FPGA-to-GPU.
|
|
|
}
|
|
|
\label{fig:latency}
|
|
|
\end{subfigure}
|
|
@@ -333,27 +327,22 @@ run-time behaviour of the operating system scheduler.
|
|
|
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
|
-We developed a complete hardware and software solution that enables DMA
|
|
|
+We developed a hardware and software solution that enables DMA
|
|
|
transfers between FPGA-based readout boards and GPU computing clusters.
|
|
|
-
|
|
|
-The net throughput is primarily limited
|
|
|
-by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
|
|
|
-CPU-based data transfer.
|
|
|
-
|
|
|
-By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
|
|
|
-by a factor of two allowing close massively parallel computation on GPUs.
|
|
|
-
|
|
|
-Moreover, the software solution that we proposed allows seamless multi-GPU
|
|
|
+The software solution that we proposed allows seamless multi-GPU
|
|
|
processing of the incoming data, due to the integration in our streamed computing
|
|
|
framework. This allows straightforward integration with different DAQ systems
|
|
|
and introduction of custom data processing algorithms.
|
|
|
|
|
|
+The net throughput is primarily limited by the PCIe bus, reaching 6.4 GB/s
|
|
|
+for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
|
|
|
+By writing directly into GPU memory instead of routing data through system
|
|
|
+main memory, the overall latency can be reduced, thus allowing close massively
|
|
|
+parallel computation on GPUs.
|
|
|
Optimization of the GPU DMA interfacing code is ongoing with the help of
|
|
|
-technical support by AMD. With a better understanding of the hardware and software aspects of
|
|
|
-DirectGMA, we expect a significant improvement in latency performance. Support
|
|
|
-for NVIDIA's GPUDirect technology is foreseen in the next months to lift the
|
|
|
-limitation of one specific GPU vendor and compare the performance of hardware by
|
|
|
-different vendors.
|
|
|
+technical support by AMD. With a better understanding of the hardware and
|
|
|
+software aspects of DirectGMA, we expect a significant improvement in the latency
|
|
|
+performance.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|
|
@@ -363,10 +352,14 @@ a single x16 device by using an external PCIe switch. With two cores operating
|
|
|
in parallel, we foresee an increase in the data throughput by a factor of 2 (as
|
|
|
demonstrated in~\cite{rota2015dma}).
|
|
|
|
|
|
-%% Where do we get this values? Any reference?
|
|
|
+Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
|
|
|
+lift the limitation of one specific GPU vendor and compare the performance of hardware by
|
|
|
+different vendors.
|
|
|
Further improvements are expected by generalizing the transfer mechanism and
|
|
|
-include Infiniband support besides the existing PCIe connection. This allows
|
|
|
-speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
|
|
|
+include Infiniband support besides the existing PCIe connection.
|
|
|
+%% Where do we get this values? Any reference?
|
|
|
+%This allows
|
|
|
+%speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
|
|
|
|
|
|
Our goal is to develop a unique hybrid solution, based on commercial standards,
|
|
|
that includes fast data transmission protocols and a high performance GPU
|