8 years ago · 710926ca71
--- a/paper.tex
+++ b/paper.tex
@@ -23,29 +23,30 @@
 
				   L.E.~Ardila Perez$^a$,
			
 
				   M.~Balzer$^a$,
			
 
				   M.~Weber$^a$\\
			
 
				-  \llap{$^a$}All authors Institute for Data Processing and Electronics,\\
			
 
				+  \llap{$^a$}Institute for Data Processing and Electronics,\\
			
 
				     Karlsruhe Institute of Technology (KIT),\\
			
 
				     Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\   
			
 
				-  E-mail: \email{lorenzo.rota@kit.edu}
			
 
				+  E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
			
 
				 }
			
 
				 
			
 
				 \abstract{%
			
 
				   Modern physics experiments have reached multi-GB/s data rates.  Fast data
			
 
				   links and high performance computing stages are required for continuous
			
 
				-  acquisition and processing. Because of their intrinsic parallelism and
			
 
				+  data acquisition and processing. Because of their intrinsic parallelism and
			
 
				   computational power, GPUs emerged as an ideal solution for high
			
 
				-  performance computing applications. To connect a fast data acquisition stage
			
 
				-  with a GPU's processing power, we developed an architecture consisting of a
			
 
				-  FPGA that includes a Direct Memory Access (DMA) engine compatible with the
			
 
				+  performance computing applications. In this paper we present a high-throughput
			
 
				+  platform based on direct FPGA-GPU communication and preliminary 
			
 
				+  results are reported.
			
 
				+  The architecture consists of a Direct Memory Access (DMA) engine compatible with the
			
 
				   Xilinx PCI-Express core, a Linux driver for register access, and high-level
			
 
				   software to manage direct memory transfers using AMD's DirectGMA technology.
			
 
				-  Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
			
 
				-  implementation is suitable for real-time DAQ system applications ranging
			
 
				+  Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s and a latency of
			
 
				+  XXX \textmu s. 
			
 
				+  Our implementation is suitable for real-time DAQ system applications ranging
			
 
				   from photon science and medical imaging to High Energy Physics (HEP) 
			
 
				   trigger systems.
			
 
				 }
			
 
				-
			
 
				-\keywords{AMD directGMA; FPGA; Readout architecture}
			
 
				+\keywords{FPGA; GPU; PCI-Express; openCL; directGPU; directGMA}
			
 
				 
			
 
				 \begin{document}
			
 
				 
			
@@ -61,66 +62,69 @@ GPU computing has become the main driving force for high performance computing
 
				 due to an unprecedented parallelism and a low cost-benefit factor. GPU
			
 
				 acceleration has found its way into numerous applications, ranging from
			
 
				 simulation to image processing. Recent years have also seen an increasing
			
 
				-interest in GPU-based systems for HEP experiments (\emph{e.g.}
			
 
				+interest in GPU-based systems for High Energy Physics (HEP) experiments (\emph{e.g.}
			
 
				 ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
			
 
				 PANDA~\cite{panda_gpu}). In a typical HEP scenario,
			
 
				-data is acquired by one or more read-out boards and then
			
 
				-transmitted in short bursts or in a continuous streaming mode to a computation stage.
			
 
				-With expected data rates of several GB/s, the data transmission link between the
			
 
				-read-out boards and the host system may partially limit the overall system
			
 
				-performance. In particular, latency becomes the most stringent specification if
			
 
				-a time-deterministic feedback is required, \emph{e.g.} in Low/High-level trigger
			
 
				-systems. Moreover, the volumes of data produced in recent photon
			
 
				+data are acquired by back-end readout systems and then
			
 
				+transmitted in short bursts or in a continuous streaming mode to a computing stage.
			
 
				+
			
 
				+With expected data rates of several GB/s, the data transmission link may partially 
			
 
				+limit the overall system performance. 
			
 
				+In particular, latency becomes the most stringent requirement for time-deterministic applications,
			
 
				+\emph{e.g.} in Low/High-level trigger systems. 
			
 
				+Furthermore, the volumes of data produced in recent photon
			
 
				 science facilities have become comparable to those traditionally associated with
			
 
				 HEP.
			
 
				 
			
 
				 In order to achieve the best performance in terms of latency and bandwidth, 
			
 
				-data transfers are handled by a dedicated DMA controller, at the cost of higher
			
 
				+data transfers must be handled by a dedicated DMA controller, at the cost of higher
			
 
				 system complexity.
			
 
				 
			
 
				 To address these problems we propose a complete hardware/software stack
			
 
				-architecture based on our own DMA engine, and integration
			
 
				-of AMD's DirectGMA technology into our processing pipeline.
			
 
				+architecture based on a high-performance DMA engine implemented on Xilinx FPGAs, 
			
 
				+and integration of AMD's DirectGMA technology into our processing pipeline.
			
 
				+
			
 
				+In our solution, PCI-express (PCIe) has been chosen as a direct data link between 
			
 
				+FPGA boards and the host computer. Due to its high bandwidth and modularity, 
			
 
				+PCIe quickly became the commercial standard for connecting high-throughput 
			
 
				+peripherals such as GPUs or solid state disks.
			
 
				+
			
 
				+Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie}, 
			
 
				+opening the possibility of using PCIe as a communication bus over long distances.
			
 
				 
			
 
				 \section{Background}
			
 
				 
			
 
				-Several solutions for direct FPGA/GPU communication are reported in literature.
			
 
				-All these are based on NVIDIA's GPUDirect technology. 
			
 
				+Several solutions for direct FPGA/GPU communication are reported in literature, 
			
 
				+and all of them are based on NVIDIA's GPUdirect technology.
			
 
				+
			
 
				+In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master 
			
 
				+during an FPGA-to-GPU data transfer, reading data from the FPGA. 
			
 
				+This solution limits the reported bandwidth and 
			
 
				+latency to, respectively, 514 MB/s and 40~\textmu s.
			
 
				 
			
 
				-The first implementation was realized by Bittner and Ruf with the Speedy 
			
 
				-PCIe Core~\cite{bittner}. In their design, during an FPGA-to-GPU data transfers,
			
 
				-the GPU acts as master and reading data
			
 
				-from the FPGA. This solution limits the reported bandwidth and 
			
 
				-latency to, respectively, 514 MB/s and 40~$\mu$s.
			
 
				+When the FPGA is used as a master a higher throughput can be achieved.
			
 
				+An example of this approach is the FPGA\textsuperscript{2} framework
			
 
				+by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
			
 
				 
			
 
				-Lonardo et~al.\ achieved lower latencies with their NaNet design, an FPGA-based
			
 
				+Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
			
 
				 PCIe network interface card~\cite{lonardo2015nanet}. 
			
 
				-The Gbe link limits the latency performance of the system to a few tens of $\mu$s.
			
 
				+The Gbe link however limits the latency performance of the system to a few tens of $\mu$s.
			
 
				 If only the FPGA-to-GPU latency is considered, the measured values span between 
			
 
				-1~$\mu$s and 6~$\mu$s, depending on the datagram size. Due to its design,
			
 
				- the bandwidth saturates at 120 MB/s. 
			
 
				+1~$\mu$s and 6~$\mu$s, depending on the datagram size. Moreover, 
			
 
				+the bandwidth saturates at 120 MB/s. 
			
 
				 
			
 
				 Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
			
 
				 of four PCIe 1.0 links~\cite{nieto2015high}.
			
 
				 Their system (as limited by the interconnect) achieves an average throughput of
			
 
				 870 MB/s with 1 KB block transfers.
			
 
				 
			
 
				-A higher throughput has been achieved with the FPGA\textsuperscript{2} framework
			
 
				-by Thoma et~al.\cite{thoma}: 2454 MB/s using a 8x Gen2.0 data link. 
			
 
				-
			
 
				-\section{Basic Concepts}
			
 
				-
			
 
				-In our solution, PCI-express (PCIe) has been chosen as a direct data link between FPGA boards and the
			
 
				-host computer. Due to its high bandwidth and modularity, PCIe quickly became the com-
			
 
				-mercial standard for connecting high-throughput peripherals such as GPUs or solid state disks.
			
 
				-Moreover, optical PCIe networks have been demonstrated a decade ago [5], opening the possibility
			
 
				-of using PCIe as a communication bus over long distances.
			
 
				+\section{Basic Concept}
			
 
				 
			
 
				 \begin{figure}[t]
			
 
				   \centering
			
 
				   \includegraphics[width=1.0\textwidth]{figures/transf}
			
 
				   \caption{%
			
 
				-    In a traditional DMA architecture (a), data is first written to the main
			
 
				+    In a traditional DMA architecture (a), data are first written to the main
			
 
				     system memory and then sent to the GPUs for final processing.  By using
			
 
				     GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
			
 
				     GPU's internal memory.
			
@@ -130,14 +134,14 @@ of using PCIe as a communication bus over long distances.
 
				 
			
 
				 As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
			
 
				 data through system main memory by copying data from the FPGA into intermediate
			
 
				-buffers and then finally into the GPU's main memory. Thus, the total throughput
			
 
				-of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
			
 
				-AMD's DirectGMA technologies allow direct communication between GPUs and
			
 
				-auxiliary devices over the PCIe bus. By combining this technology with DMA data
			
 
				-transfers (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the system
			
 
				-is reduced and total throughput increased. Moreover, the CPU and main system
			
 
				-memory are relieved from processing because they are not directly involved in
			
 
				-the data transfer anymore.
			
 
				+buffers and then finally into the GPU's main memory. 
			
 
				+Thus, the total throughput and latency of the system is limited by the main 
			
 
				+memory bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow 
			
 
				+direct communication between GPUs and auxiliary devices over the PCIe bus. 
			
 
				+By combining this technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)),
			
 
				+the overall latency of the system is reduced and total throughput increased. 
			
 
				+Moreover, the CPU and main system memory are relieved from processing because 
			
 
				+they are not directly involved in the data transfer anymore.
			
 
				 
			
 
				 \subsection{DMA engine implementation on the FPGA}
			
 
				 
			
@@ -155,8 +159,23 @@ engine are configured by the host through PIO registers.
 
				 The physical addresses of the host's memory buffers are stored into an internal
			
 
				 memory and are dynamically updated by the driver or user, allowing highly
			
 
				 efficient zero-copy data transfers. The maximum size associated with each
			
 
				-address is 2 GB.
			
 
				-
			
 
				+address is 2 GB. The engine fully supports 64-bit addresses. The resource utilization
			
 
				+on a Virtex 7 device is reported in \ref{table:utilization}.
			
 
				+
			
 
				+% Please add the following required packages to your document preamble:
			
 
				+% \usepackage{booktabs}
			
 
				+\begin{table}[]
			
 
				+\centering
			
 
				+\caption{Resource utilization on}
			
 
				+\label{table:utilization}
			
 
				+\begin{tabular}{@{}llll@{}}
			
 
				+Resource & Utilization & Available & Utilization \% \\\hline
			
 
				+LUT      & 5331        & 433200    & 1.23           \\
			
 
				+LUTRAM   & 56          & 174200    & 0.03           \\
			
 
				+FF       & 5437        & 866400    & 0.63           \\
			
 
				+BRAM     & 20.50       & 1470      & 1.39           \\\hline
			
 
				+\end{tabular}
			
 
				+\end{table}
			
 
				 
			
 
				 \subsection{OpenCL management on host side}
			
 
				 \label{sec:host}
			
@@ -182,11 +201,11 @@ GPU's address space passing a special AMD-specific flag and passing the physical
 
				 BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
			
 
				 function. From the GPU, this memory is seen transparently as regular GPU memory
			
 
				 and can be written accordingly (3). In our setup, trigger registers are used to
			
 
				-notify the FPGA on successful or failed evaluation of the data. These individual
			
 
				-write accesses are issued as PIO commands which can at most transfer 32-bit
			
 
				-words. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
			
 
				-to circumvent this limitation and write entire memory regions in DMA fashion to
			
 
				-the FPGA. In this case, the GPU acts as bus master and pushes data to the FPGA.
			
 
				+notify the FPGA on successful or failed evaluation of the data. 
			
 
				+
			
 
				+Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
			
 
				+to write entire memory regions in DMA fashion to the FPGA. 
			
 
				+In this case, the GPU acts as bus master and pushes data to the FPGA.
			
 
				 
			
 
				 \begin{figure}
			
 
				   \centering
			
@@ -203,9 +222,11 @@ framework allows for an easy construction of streamed data processing on
 
				 heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
			
 
				 from its specific data format and run a Fourier transform on the GPU as well as
			
 
				 writing back the results to disk, one can run the following on the command line:
			
 
				+
			
 
				 \begin{verbatim}
			
 
				 ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
			
 
				 \end{verbatim}
			
 
				+
			
 
				 The framework takes care of scheduling the tasks and distributing the data items
			
 
				 to one or more GPUs. High throughput is achieved by the combination of fine-
			
 
				 and coarse-grained data parallelism, \emph{i.e.} processing a single data item
			
@@ -309,12 +330,15 @@ run-time behaviour of the operating system scheduler.
 
				 \section{Conclusion and outlook}
			
 
				 
			
 
				 We developed a complete hardware and software solution that enables DMA
			
 
				-transfers between FPGA-based readout boards and GPU computing clusters with
			
 
				-reasonable performance characteristics. The net throughput is primarily limited
			
 
				+transfers between FPGA-based readout boards and GPU computing clusters.
			
 
				+
			
 
				+The net throughput is primarily limited
			
 
				 by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
			
 
				-CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
			
 
				-of routing data through system main memory, the overall latency can be reduced
			
 
				+CPU-based data transfer. 
			
 
				+
			
 
				+By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
			
 
				 by a factor of two allowing close massively parallel computation on GPUs.
			
 
				+
			
 
				 Moreover, the software solution that we proposed allows seamless multi-GPU
			
 
				 processing of the incoming data, due to the integration in our streamed computing
			
 
				 framework. This allows straightforward integration with different DAQ systems
			
@@ -333,7 +357,10 @@ fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
 
				 x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
			
 
				 a single x16 device by using an external PCIe switch. With two cores operating
			
 
				 in parallel, we foresee an increase in the data throughput by a factor of 2 (as
			
 
				-demonstrated in~\cite{rota2015dma}). Further improvements are expected by
			
 
				+demonstrated in~\cite{rota2015dma}). 
			
 
				+
			
 
				+%% Where do we get this values? Any reference?
			
 
				+Further improvements are expected by
			
 
				 generalizing the transfer mechanism and include Infiniband support besides the
			
 
				 existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
			
 
				 low as 0.5 \textmu s.