Browse Source

Added utilization and fixed some stuff

Lorenzo 8 years ago
parent
commit
710926ca71
1 changed files with 90 additions and 63 deletions
  1. 90 63
      paper.tex

+ 90 - 63
paper.tex

@@ -23,29 +23,30 @@
   L.E.~Ardila Perez$^a$,
   M.~Balzer$^a$,
   M.~Weber$^a$\\
-  \llap{$^a$}All authors Institute for Data Processing and Electronics,\\
+  \llap{$^a$}Institute for Data Processing and Electronics,\\
     Karlsruhe Institute of Technology (KIT),\\
     Herrmann-von-Helmholtz-Platz 1, Karlsruhe, Germany \\   
-  E-mail: \email{lorenzo.rota@kit.edu}
+  E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
 }
 
 \abstract{%
   Modern physics experiments have reached multi-GB/s data rates.  Fast data
   links and high performance computing stages are required for continuous
-  acquisition and processing. Because of their intrinsic parallelism and
+  data acquisition and processing. Because of their intrinsic parallelism and
   computational power, GPUs emerged as an ideal solution for high
-  performance computing applications. To connect a fast data acquisition stage
-  with a GPU's processing power, we developed an architecture consisting of a
-  FPGA that includes a Direct Memory Access (DMA) engine compatible with the
+  performance computing applications. In this paper we present a high-throughput
+  platform based on direct FPGA-GPU communication and preliminary 
+  results are reported.
+  The architecture consists of a Direct Memory Access (DMA) engine compatible with the
   Xilinx PCI-Express core, a Linux driver for register access, and high-level
   software to manage direct memory transfers using AMD's DirectGMA technology.
-  Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
-  implementation is suitable for real-time DAQ system applications ranging
+  Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s and a latency of
+  XXX \textmu s. 
+  Our implementation is suitable for real-time DAQ system applications ranging
   from photon science and medical imaging to High Energy Physics (HEP) 
   trigger systems.
 }
-
-\keywords{AMD directGMA; FPGA; Readout architecture}
+\keywords{FPGA; GPU; PCI-Express; openCL; directGPU; directGMA}
 
 \begin{document}
 
@@ -61,66 +62,69 @@ GPU computing has become the main driving force for high performance computing
 due to an unprecedented parallelism and a low cost-benefit factor. GPU
 acceleration has found its way into numerous applications, ranging from
 simulation to image processing. Recent years have also seen an increasing
-interest in GPU-based systems for HEP experiments (\emph{e.g.}
+interest in GPU-based systems for High Energy Physics (HEP) experiments (\emph{e.g.}
 ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
 PANDA~\cite{panda_gpu}). In a typical HEP scenario,
-data is acquired by one or more read-out boards and then
-transmitted in short bursts or in a continuous streaming mode to a computation stage.
-With expected data rates of several GB/s, the data transmission link between the
-read-out boards and the host system may partially limit the overall system
-performance. In particular, latency becomes the most stringent specification if
-a time-deterministic feedback is required, \emph{e.g.} in Low/High-level trigger
-systems. Moreover, the volumes of data produced in recent photon
+data are acquired by back-end readout systems and then
+transmitted in short bursts or in a continuous streaming mode to a computing stage.
+
+With expected data rates of several GB/s, the data transmission link may partially 
+limit the overall system performance. 
+In particular, latency becomes the most stringent requirement for time-deterministic applications,
+\emph{e.g.} in Low/High-level trigger systems. 
+Furthermore, the volumes of data produced in recent photon
 science facilities have become comparable to those traditionally associated with
 HEP.
 
 In order to achieve the best performance in terms of latency and bandwidth, 
-data transfers are handled by a dedicated DMA controller, at the cost of higher
+data transfers must be handled by a dedicated DMA controller, at the cost of higher
 system complexity.
 
 To address these problems we propose a complete hardware/software stack
-architecture based on our own DMA engine, and integration
-of AMD's DirectGMA technology into our processing pipeline.
+architecture based on a high-performance DMA engine implemented on Xilinx FPGAs, 
+and integration of AMD's DirectGMA technology into our processing pipeline.
+
+In our solution, PCI-express (PCIe) has been chosen as a direct data link between 
+FPGA boards and the host computer. Due to its high bandwidth and modularity, 
+PCIe quickly became the commercial standard for connecting high-throughput 
+peripherals such as GPUs or solid state disks.
+
+Moreover, optical PCIe networks have been demonstrated a decade ago~\cite{optical_pcie}, 
+opening the possibility of using PCIe as a communication bus over long distances.
 
 \section{Background}
 
-Several solutions for direct FPGA/GPU communication are reported in literature.
-All these are based on NVIDIA's GPUDirect technology. 
+Several solutions for direct FPGA/GPU communication are reported in literature, 
+and all of them are based on NVIDIA's GPUdirect technology.
+
+In the implementation of Bittner and Ruf ~\cite{bittner} the GPU acts as master 
+during an FPGA-to-GPU data transfer, reading data from the FPGA. 
+This solution limits the reported bandwidth and 
+latency to, respectively, 514 MB/s and 40~\textmu s.
 
-The first implementation was realized by Bittner and Ruf with the Speedy 
-PCIe Core~\cite{bittner}. In their design, during an FPGA-to-GPU data transfers,
-the GPU acts as master and reading data
-from the FPGA. This solution limits the reported bandwidth and 
-latency to, respectively, 514 MB/s and 40~$\mu$s.
+When the FPGA is used as a master a higher throughput can be achieved.
+An example of this approach is the FPGA\textsuperscript{2} framework
+by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
 
-Lonardo et~al.\ achieved lower latencies with their NaNet design, an FPGA-based
+Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
 PCIe network interface card~\cite{lonardo2015nanet}. 
-The Gbe link limits the latency performance of the system to a few tens of $\mu$s.
+The Gbe link however limits the latency performance of the system to a few tens of $\mu$s.
 If only the FPGA-to-GPU latency is considered, the measured values span between 
-1~$\mu$s and 6~$\mu$s, depending on the datagram size. Due to its design,
- the bandwidth saturates at 120 MB/s. 
+1~$\mu$s and 6~$\mu$s, depending on the datagram size. Moreover, 
+the bandwidth saturates at 120 MB/s. 
 
 Nieto et~al.\ presented a system based on a PXIexpress data link that makes use
 of four PCIe 1.0 links~\cite{nieto2015high}.
 Their system (as limited by the interconnect) achieves an average throughput of
 870 MB/s with 1 KB block transfers.
 
-A higher throughput has been achieved with the FPGA\textsuperscript{2} framework
-by Thoma et~al.\cite{thoma}: 2454 MB/s using a 8x Gen2.0 data link. 
-
-\section{Basic Concepts}
-
-In our solution, PCI-express (PCIe) has been chosen as a direct data link between FPGA boards and the
-host computer. Due to its high bandwidth and modularity, PCIe quickly became the com-
-mercial standard for connecting high-throughput peripherals such as GPUs or solid state disks.
-Moreover, optical PCIe networks have been demonstrated a decade ago [5], opening the possibility
-of using PCIe as a communication bus over long distances.
+\section{Basic Concept}
 
 \begin{figure}[t]
   \centering
   \includegraphics[width=1.0\textwidth]{figures/transf}
   \caption{%
-    In a traditional DMA architecture (a), data is first written to the main
+    In a traditional DMA architecture (a), data are first written to the main
     system memory and then sent to the GPUs for final processing.  By using
     GPUDirect/DirectGMA technology (b), the DMA engine has direct access to
     GPU's internal memory.
@@ -130,14 +134,14 @@ of using PCIe as a communication bus over long distances.
 
 As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
 data through system main memory by copying data from the FPGA into intermediate
-buffers and then finally into the GPU's main memory. Thus, the total throughput
-of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
-AMD's DirectGMA technologies allow direct communication between GPUs and
-auxiliary devices over the PCIe bus. By combining this technology with DMA data
-transfers (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the system
-is reduced and total throughput increased. Moreover, the CPU and main system
-memory are relieved from processing because they are not directly involved in
-the data transfer anymore.
+buffers and then finally into the GPU's main memory. 
+Thus, the total throughput and latency of the system is limited by the main 
+memory bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow 
+direct communication between GPUs and auxiliary devices over the PCIe bus. 
+By combining this technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)),
+the overall latency of the system is reduced and total throughput increased. 
+Moreover, the CPU and main system memory are relieved from processing because 
+they are not directly involved in the data transfer anymore.
 
 \subsection{DMA engine implementation on the FPGA}
 
@@ -155,8 +159,23 @@ engine are configured by the host through PIO registers.
 The physical addresses of the host's memory buffers are stored into an internal
 memory and are dynamically updated by the driver or user, allowing highly
 efficient zero-copy data transfers. The maximum size associated with each
-address is 2 GB.
-
+address is 2 GB. The engine fully supports 64-bit addresses. The resource utilization
+on a Virtex 7 device is reported in \ref{table:utilization}.
+
+% Please add the following required packages to your document preamble:
+% \usepackage{booktabs}
+\begin{table}[]
+\centering
+\caption{Resource utilization on}
+\label{table:utilization}
+\begin{tabular}{@{}llll@{}}
+Resource & Utilization & Available & Utilization \% \\\hline
+LUT      & 5331        & 433200    & 1.23           \\
+LUTRAM   & 56          & 174200    & 0.03           \\
+FF       & 5437        & 866400    & 0.63           \\
+BRAM     & 20.50       & 1470      & 1.39           \\\hline
+\end{tabular}
+\end{table}
 
 \subsection{OpenCL management on host side}
 \label{sec:host}
@@ -182,11 +201,11 @@ GPU's address space passing a special AMD-specific flag and passing the physical
 BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
 function. From the GPU, this memory is seen transparently as regular GPU memory
 and can be written accordingly (3). In our setup, trigger registers are used to
-notify the FPGA on successful or failed evaluation of the data. These individual
-write accesses are issued as PIO commands which can at most transfer 32-bit
-words. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
-to circumvent this limitation and write entire memory regions in DMA fashion to
-the FPGA. In this case, the GPU acts as bus master and pushes data to the FPGA.
+notify the FPGA on successful or failed evaluation of the data. 
+
+Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible
+to write entire memory regions in DMA fashion to the FPGA. 
+In this case, the GPU acts as bus master and pushes data to the FPGA.
 
 \begin{figure}
   \centering
@@ -203,9 +222,11 @@ framework allows for an easy construction of streamed data processing on
 heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
 from its specific data format and run a Fourier transform on the GPU as well as
 writing back the results to disk, one can run the following on the command line:
+
 \begin{verbatim}
 ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
 \end{verbatim}
+
 The framework takes care of scheduling the tasks and distributing the data items
 to one or more GPUs. High throughput is achieved by the combination of fine-
 and coarse-grained data parallelism, \emph{i.e.} processing a single data item
@@ -309,12 +330,15 @@ run-time behaviour of the operating system scheduler.
 \section{Conclusion and outlook}
 
 We developed a complete hardware and software solution that enables DMA
-transfers between FPGA-based readout boards and GPU computing clusters with
-reasonable performance characteristics. The net throughput is primarily limited
+transfers between FPGA-based readout boards and GPU computing clusters.
+
+The net throughput is primarily limited
 by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
-CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
-of routing data through system main memory, the overall latency can be reduced
+CPU-based data transfer. 
+
+By writing directly into GPU memory instead of routing data through system main memory, the overall latency can be reduced
 by a factor of two allowing close massively parallel computation on GPUs.
+
 Moreover, the software solution that we proposed allows seamless multi-GPU
 processing of the incoming data, due to the integration in our streamed computing
 framework. This allows straightforward integration with different DAQ systems
@@ -333,7 +357,10 @@ fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
 x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
 a single x16 device by using an external PCIe switch. With two cores operating
 in parallel, we foresee an increase in the data throughput by a factor of 2 (as
-demonstrated in~\cite{rota2015dma}). Further improvements are expected by
+demonstrated in~\cite{rota2015dma}). 
+
+%% Where do we get this values? Any reference?
+Further improvements are expected by
 generalizing the transfer mechanism and include Infiniband support besides the
 existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
 low as 0.5 \textmu s.