8 years ago · d280e50f54
--- a/paper.tex
+++ b/paper.tex
@@ -17,16 +17,17 @@
 
				 \author{
			
 
				   L.~Rota$^a$,
			
 
				   M.~Vogelgesang$^a$,
			
 
				-  N.~Zilio$^a$,
			
 
				-  M.~Caselle$^a$,
			
 
				-  S.~Chilingaryan$^a$,
			
 
				   L.E.~Ardila Perez$^a$,
			
 
				   M.~Balzer$^a$,
			
 
				+  M.~Caselle$^a$,
			
 
				+  S.~Chilingaryan$^a$,
			
 
				+  T.~Dritschler$^a$,
			
 
				   M.~Weber$^a$\\
			
 
				+  N.~Zilio$^a$,
			
 
				   \llap{$^a$}Institute for Data Processing and Electronics,\\
			
 
				     Karlsruhe Institute of Technology (KIT),\\
			
 
				     Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\
			
 
				-  E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
			
 
				+    E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
			
 
				 }
			
 
				 
			
 
				 \abstract{%
			
@@ -35,12 +36,12 @@
 
				   acquisition and processing. Because of their intrinsic parallelism and
			
 
				   computational power, GPUs emerged as an ideal solution to process this data in
			
 
				   high performance computing applications. In this paper we present a
			
 
				-  high-throughput platform based on direct FPGA-GPU communication and
			
 
				-  preliminary latency and throughput results. The architecture consists of a
			
 
				+  high-throughput platform based on direct FPGA-GPU communication. 
			
 
				+  The architecture consists of a
			
 
				   Direct Memory Access (DMA) engine compatible with the Xilinx PCI-Express core,
			
 
				   a Linux driver for register access, and high-level software to manage direct
			
 
				-  memory transfers using AMD's DirectGMA technology.  Measurements with a Gen3
			
 
				-  x8 link shows a throughput of up to 6.4 GB/s and a latency of XXX \textmu s.
			
 
				+  memory transfers using AMD's DirectGMA technology. Preliminary measurements with a Gen3
			
 
				+  x8 link show a throughput of up to 6.4 GB/s and a latency of 40 \textmu s.
			
 
				   Our implementation is suitable for real-time DAQ system applications ranging
			
 
				   from photon science and medical imaging to High Energy Physics (HEP) trigger
			
 
				   systems.
			
@@ -74,67 +75,48 @@ in Low/High-level trigger systems.  Furthermore, the amount of data produced in
 
				 current generation photon science facilities have become comparable to those
			
 
				 traditionally associated with HEP.
			
 
				 
			
 
				-% MV: too many little paragraphs in my opinion
			
 
				-In order to achieve the best performance in terms of latency and bandwidth,
			
 
				-data transfers must be handled by a dedicated DMA controller, at the cost of higher
			
 
				-system complexity.
			
 
				-
			
 
				-To address these problems we propose a complete hardware/software stack
			
 
				-architecture based on a high-performance DMA engine implemented on Xilinx FPGAs,
			
 
				-and integration of AMD's DirectGMA technology into our processing pipeline.
			
 
				-
			
 
				-In our solution, PCI-express (PCIe) has been chosen as a direct data link between
			
 
				-FPGA boards and the host computer. Due to its high bandwidth and modularity,
			
 
				+Due to its high bandwidth and modularity,
			
 
				 PCIe quickly became the commercial standard for connecting high-throughput
			
 
				-peripherals such as GPUs or solid state disks.
			
 
				-
			
 
				-Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie},
			
 
				-opening the possibility of using PCIe as a communication bus over long distances.
			
 
				-
			
 
				-
			
 
				-\section{Background}
			
 
				-
			
 
				-Several solutions for direct FPGA/GPU communication are reported in literature,
			
 
				-and all of them are based on NVIDIA's GPUdirect technology.
			
 
				+peripherals such as GPUs or solid state disks.Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe as a communication bus over long distances.
			
 
				 
			
 
				+Several solutions for direct FPGA/GPU communication based on PCIe are reported 
			
 
				+in literature, and all of them are based on NVIDIA's GPUdirect technology.
			
 
				 In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master
			
 
				 during an FPGA-to-GPU data transfer, reading data from the FPGA.  This solution
			
 
				 limits the reported bandwidth and latency to 514 MB/s and 40~\textmu s,
			
 
				 respectively.
			
 
				-
			
 
				+%LR: FPGA^2 it's the name of their thing...
			
 
				 When the FPGA is used as a master, a higher throughput can be achieved.  An
			
 
				-example of this approach is the FPGAs
			
 
				-% MV: what was this superscript about?
			
 
				-%\textsuperscript{2}
			
 
				+example of this approach is the FPGA\textsuperscript{2}
			
 
				 framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0
			
 
				 data link.
			
 
				-
			
 
				 Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
			
 
				 PCIe network interface card~\cite{lonardo2015nanet}.  The Gbe link however
			
 
				 limits the latency performance of the system to a few tens of \textmu s. If only
			
 
				 the FPGA-to-GPU latency is considered, the measured values span between
			
 
				 1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
			
 
				 bandwidth saturates at 120 MB/s.
			
 
				-
			
 
				 Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
			
 
				 of four PCIe 1.0 links~\cite{nieto2015high}.
			
 
				 Their system (as limited by the interconnect) achieves an average throughput of
			
 
				 870 MB/s with 1 KB block transfers.
			
 
				 
			
 
				+In order to achieve the best performance in terms of latency and bandwidth,
			
 
				+we developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.
			
 
				 
			
 
				-\section{Basic concept}
			
 
				+To process the data, we encapsulated the DMA setup and memory mapping in a
			
 
				+plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
			
 
				+framework allows for an easy construction of streamed data processing on
			
 
				+heterogeneous multi-GPU systems. However, the framework is based on OpenCL, 
			
 
				+and therefore integration with NVIDIA's CUDA functions for GPUDirect technology
			
 
				+is not possible. 
			
 
				 
			
 
				-\begin{figure}[t]
			
 
				-  \centering
			
 
				-  \includegraphics[width=1.0\textwidth]{figures/transf}
			
 
				-  \caption{%
			
 
				-    In a traditional DMA architecture (a), data are first written to the main
			
 
				-    system memory and then sent to the GPUs for final processing.  By using
			
 
				-    GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
			
 
				-    GPU's internal memory.
			
 
				-  }
			
 
				-  \label{fig:trad-vs-dgpu}
			
 
				-\end{figure}
			
 
				+We therefore integrated direct FPGA-to-GPU communication into our processing pipeline
			
 
				+using AMD's DirectGMA technology. In this paper we report the performance of our
			
 
				+DMA engine for FPGA-to-CPU communication and the first preliminary results with 
			
 
				+DirectGMA technology.
			
 
				+
			
 
				+\section{Architecture}
			
 
				 
			
 
				 As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
			
 
				 data through system main memory by copying data from the FPGA into intermediate
			
@@ -147,6 +129,18 @@ the overall latency of the system is reduced and total throughput increased.
 
				 Moreover, the CPU and main system memory are relieved from processing because
			
 
				 they are not directly involved in the data transfer anymore.
			
 
				 
			
 
				+\begin{figure}[t]
			
 
				+  \centering
			
 
				+  \includegraphics[width=1.0\textwidth]{figures/transf}
			
 
				+  \caption{%
			
 
				+    In a traditional DMA architecture (a), data are first written to the main
			
 
				+    system memory and then sent to the GPUs for final processing.  By using
			
 
				+    GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
			
 
				+    GPU's internal memory.
			
 
				+  }
			
 
				+  \label{fig:trad-vs-dgpu}
			
 
				+\end{figure}
			
 
				+
			
 
				 \subsection{DMA engine implementation on the FPGA}
			
 
				 
			
 
				 We have developed a DMA architecture that minimizes resource utilization while
			
@@ -155,9 +149,7 @@ policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
 
				 IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
			
 
				 main system memory and GPU memory are both supported. Two FIFOs, with a data
			
 
				 width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
			
 
				-the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to
			
 
				-saturate a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe
			
 
				-3.0 x8 link with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA
			
 
				+the custom logic with an input bandwidth of 7.45 GB/s. The user logic and the DMA
			
 
				 engine are configured by the host through PIO registers.
			
 
				 
			
 
				 The physical addresses of the host's memory buffers are stored into an internal
			
@@ -245,6 +237,8 @@ develop custom applications written in C or high-level languages such as Python.
 
				 We carried out performance measurements on a machine with an Intel Xeon E5-1630
			
 
				 at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux 3.11.10. The
			
 
				 Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
			
 
				+In case of FPGA-to-CPU data transfers, the software implementation is the one 
			
 
				+described in~\cite{rota2015dma}.
			
 
				 
			
 
				 \begin{figure}
			
 
				   \centering
			
@@ -261,7 +255,7 @@ Xilinx VC709 evaluation board was plugged into one of the PCIe 3.0 x8 slots.
 
				     \caption{%
			
 
				       Latency distribution.
			
 
				       % for a single 4 KB packet transferred
			
 
				-      % from FPGA to CPU and FPGA to GPU.
			
 
				+      % from FPGA-to-CPU and FPGA-to-GPU.
			
 
				     }
			
 
				     \label{fig:latency}
			
 
				   \end{subfigure}
			
@@ -333,27 +327,22 @@ run-time behaviour of the operating system scheduler.
 
				 
			
 
				 \section{Conclusion and outlook}
			
 
				 
			
 
				-We developed a complete hardware and software solution that enables DMA
			
 
				+We developed a hardware and software solution that enables DMA
			
 
				 transfers between FPGA-based readout boards and GPU computing clusters.
			
 
				-
			
 
				-The net throughput is primarily limited
			
 
				-by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
			
 
				-CPU-based data transfer.
			
 
				-
			
 
				-By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
			
 
				-by a factor of two allowing close massively parallel computation on GPUs.
			
 
				-
			
 
				-Moreover, the software solution that we proposed allows seamless multi-GPU
			
 
				+The software solution that we proposed allows seamless multi-GPU
			
 
				 processing of the incoming data, due to the integration in our streamed computing
			
 
				 framework. This allows straightforward integration with different DAQ systems
			
 
				 and introduction of custom data processing algorithms.
			
 
				 
			
 
				+The net throughput is primarily limited by the PCIe bus, reaching 6.4 GB/s
			
 
				+for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
			
 
				+By writing directly into GPU memory instead of routing data through system
			
 
				+main memory, the overall latency can be reduced, thus allowing close massively
			
 
				+parallel computation on GPUs.
			
 
				 Optimization of the GPU DMA interfacing code is ongoing with the help of
			
 
				-technical support by AMD. With a better understanding of the hardware and software aspects of
			
 
				-DirectGMA, we expect a significant improvement in latency performance.  Support
			
 
				-for NVIDIA's GPUDirect technology is foreseen in the next months to lift the
			
 
				-limitation of one specific GPU vendor and compare the performance of hardware by
			
 
				-different vendors.
			
 
				+technical support by AMD. With a better understanding of the hardware and 
			
 
				+software aspects of DirectGMA, we expect a significant improvement in the latency
			
 
				+performance.
			
 
				 
			
 
				 In order to increase the total throughput, a custom FPGA evaluation board is
			
 
				 currently under development. The board mounts a Virtex-7 chip and features two
			
@@ -363,10 +352,14 @@ a single x16 device by using an external PCIe switch. With two cores operating
 
				 in parallel, we foresee an increase in the data throughput by a factor of 2 (as
			
 
				 demonstrated in~\cite{rota2015dma}).
			
 
				 
			
 
				-%% Where do we get this values? Any reference?
			
 
				+Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
			
 
				+lift the limitation of one specific GPU vendor and compare the performance of hardware by
			
 
				+different vendors.
			
 
				 Further improvements are expected by generalizing the transfer mechanism and
			
 
				-include Infiniband support besides the existing PCIe connection. This allows
			
 
				-speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
			
 
				+include Infiniband support besides the existing PCIe connection.
			
 
				+%% Where do we get this values? Any reference?
			
 
				+%This allows
			
 
				+%speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
			
 
				 
			
 
				 Our goal is to develop a unique hybrid solution, based on commercial standards,
			
 
				 that includes fast data transmission protocols and a high performance GPU