Browse Source

Fix that language

Matthias Vogelgesang 8 years ago
parent
commit
a8a2b3ad08
1 changed files with 92 additions and 79 deletions
  1. 92 79
      paper.tex

+ 92 - 79
paper.tex

@@ -19,22 +19,22 @@
 }
 }
 
 
 \abstract{%
 \abstract{%
-  %% Old   
-  \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
-  data links.}
+  %% Old
+  % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
+  % data links.}
   %proposal for new abstract, including why do we need GPUs
   %proposal for new abstract, including why do we need GPUs
-  Modern physics experiments have reached multi-GB/s data rates.
-  Fast data links and high performance computing stages are required 
-  to enable continuous acquisition. Because of their intrinsic parallelism 
-  and high computational power, GPUs emerged as the ideal computing solution.
-  We developed an architecture consisting of a hardware/software
-  stack which includes a Direct Memory Access (DMA) engine compatible with the Xilinx
-  PCI-Express core and an accompanying Linux driver. Measurements with a
-  Gen3 x8 link show a throughput of up to 6.7 GB/s. We also enable direct
-  communication between FPGA-based DAQ electronics and second-level data
-  processing GPUs via AMD DirectGMA technology. Our implementation finds its
-  application in real-time DAQ systems for photon science and medical imaging
-  and in triggers for HEP experiments. 
+  Modern physics experiments have reached multi-GB/s data rates.  Fast data
+  links and high performance computing stages are required for continuous
+  acquisition and processing. Because of their intrinsic parallelism and
+  computational power, GPUs emerged as an ideal computing solution for high
+  performance computing applications. To connect a fast data acquisition stage
+  with a GPU's processing power, we developed an architecture consisting of a
+  FPGA that includes a Direct Memory Access (DMA) engine compatible with the
+  Xilinx PCI-Express core, a Linux driver for register access and high-level
+  software to manage direct memory transfers using AMD's DirectGMA technology.
+  Measurements with a Gen3 x8 link shows a throughput of up to 6.x GB/s. Our
+  implementation is suitable for real-time DAQ system applications ranging
+  photon science and medical imaging to HEP experiment triggers.
 }
 }
 
 
 
 
@@ -48,7 +48,7 @@
 
 
 \section{Motivation}
 \section{Motivation}
 
 
-GPU computing has become the main driving force for high-performance computing
+GPU computing has become the main driving force for high performance computing
 due to an unprecedented parallelism and a low cost-benefit factor. GPU
 due to an unprecedented parallelism and a low cost-benefit factor. GPU
 acceleration has found its way into numerous applications, ranging from
 acceleration has found its way into numerous applications, ranging from
 simulation to image processing. Recent years have also seen an increasing
 simulation to image processing. Recent years have also seen an increasing
@@ -59,7 +59,7 @@ PANDA~\cite{panda_gpu}). Moreover, the volumes of data produced in recent photon
 science facilities have become comparable to those traditionally associated with
 science facilities have become comparable to those traditionally associated with
 HEP.
 HEP.
 
 
-In such experiments data is acquired by one or more read-out boards and then
+In HEP experiments data is acquired by one or more read-out boards and then
 transmitted to GPUs in short bursts or in a continuous streaming mode. With
 transmitted to GPUs in short bursts or in a continuous streaming mode. With
 expected data rates of several GB/s, the data transmission link between the
 expected data rates of several GB/s, the data transmission link between the
 read-out boards and the host system may partially limit the overall system
 read-out boards and the host system may partially limit the overall system
@@ -67,31 +67,33 @@ performance. In particular, latency becomes the most stringent specification if
 a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
 a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
 
 
 To address these problems we propose a complete hardware/software stack
 To address these problems we propose a complete hardware/software stack
-architecture based on our own DMA design and integration of AMD's DirectGMA
-technology into our processing pipeline. In our solution, PCIe has been chosen
-as data link between FPGA boards and external computing. Due to its high
-bandwidth and modularity, it quickly became the commercial standard for
-connecting high-performance peripherals such as GPUs or solid state disks.
-Optical PCIe networks have been demonstrated since nearly a
-decade~\cite{optical_pcie}, opening the possibility of using PCIe as a
-communication bus over long distances. In particular, in HEP DAQ systems,
+architecture based on our own Direct Memory Access (DMA) design and integration
+of AMD's DirectGMA technology into our processing pipeline. In our solution,
+PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
+host computer. Due to its high bandwidth and modularity, PCIe quickly became the
+commercial standard for connecting high-throughput peripherals such as GPUs or
+solid state disks.  Optical PCIe networks have been demonstrated
+% JESUS: time span -> for, point in time -> since ...
+for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
+as a communication bus over long distances. In particular, in HEP DAQ systems,
 optical links are preferred over electrical ones because of their superior
 optical links are preferred over electrical ones because of their superior
 radiation hardness, lower power consumption and higher density.
 radiation hardness, lower power consumption and higher density.
 
 
 %% Added some more here, I need better internet to find the correct references
 %% Added some more here, I need better internet to find the correct references
 Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
 Lonardo et~al.\ lifted this limitation with their NaNet design, an FPGA-based
-PCIe network interface card with NVIDIA's GPUdirect integration~\cite{lonardo2015nanet}.
-Due to its design, the bandwidth saturates at 120 MB/s for a 1472 byte large UDP
-datagram. Moreover, the system is based on a commercial PCIe engine.
-Other solutions achieve higher throughput based on Xilinx (CITE TWEPP DMA WURTT??)
-or Altera devices (CITENICHOLASPAPER TNS), but they do not provide support for direct 
-FPGA-GPU communication.
+PCIe network interface card with NVIDIA's GPUDirect
+integration~\cite{lonardo2015nanet}.  Due to its design, the bandwidth saturates
+at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
+a commercial PCIe engine.  Other solutions achieve higher throughput based on
+Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
+they do not provide support for direct FPGA-GPU communication.
+
 
 
 \section{Architecture}
 \section{Architecture}
 
 
-Direct Memory Access (DMA) data transfers are handled by dedicated hardware, 
-and, when compared with Programmed Input Output (PIO) access, they offer 
-lower latency and higher throughput at the cost of a higher system's complexity. 
+DMA data transfers are handled by dedicated hardware, which compared with
+Programmed Input Output (PIO) access, offer lower latency and higher throughput
+at the cost of higher system complexity.
 
 
 
 
 \begin{figure}[t]
 \begin{figure}[t]
@@ -106,49 +108,59 @@ lower latency and higher throughput at the cost of a higher system's complexity.
   \label{fig:trad-vs-dgpu}
   \label{fig:trad-vs-dgpu}
 \end{figure}
 \end{figure}
 
 
-As shown in \figref{fig:trad-vs-dgpu}, traditional FPGA-GPU architectures 
-route data through system main memory. The main memory is involved in a certain 
-number of read/write operations, depending on the specific implementation. 
-The total throughput of the system is therefore limited by the main memory 
-bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technlogies allow direct 
-communication between different devices over the PCIe bus. 
-By combining this technology with a DMA data
-transfer, the overall latency of the system is reduced and total throughput
-increased. Moreover, the CPU and main system memory are relieved because they
-are not directly involved in the data transfer anymore.
+As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
+data through system main memory by copying data from the FPGA into intermediate
+buffers and then finally into the GPU's main memory. Thus, the total throughput
+of the system is limited by the main memory bandwidth. NVIDIA's GPUDirect and
+AMD's DirectGMA technologies allow direct communication between GPUs and
+auxiliary devices over the PCIe bus.  By combining this technology with a DMA
+data transfer (see \figref{fig:trad-vs-dgpu} (b)), the overall latency of the
+system is reduced and total throughput increased. Moreover, the CPU and main
+system memory are relieved from processing because they are not directly
+involved in the data transfer anymore.
+
 
 
-\subsection{Implementation of the DMA engine on FPGA}
+\subsection{DMA engine implementation on the FPGA}
 
 
 We have developed a DMA architecture that minimizes resource utilization while
 We have developed a DMA architecture that minimizes resource utilization while
 maintaining the flexibility of a Scatter-Gather memory
 maintaining the flexibility of a Scatter-Gather memory
 policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
 policy~\cite{rota2015dma}. The engine is compatible with the Xilinx PCIe Gen2/3
-IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to main
-system memory and GPU memory are both supported. Two FIFOs, with a data width of 256
-bits and operating at 250 MHz, act as user-friendly interfaces with the custom logic.
-The resulting input bandwidth is 7.8 GB/s, enough to saturate a PCIe Gen3 x8
-link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link with a payload
-of 1024 B is 7.6 GB/s}.  The user logic and the DMA engine are configured by the
-host through PIO registers.
+IP-Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA transmissions to
+main system memory and GPU memory are both supported. Two FIFOs, with a data
+width of 256 bits and operating at 250 MHz, act as user-friendly interfaces with
+the custom logic. The resulting input bandwidth of 7.8 GB/s is enough to saturate
+a PCIe Gen3 x8 link\footnote{The theoretical net bandwidth of a PCIe 3.0 x8 link
+with a payload of 1024 B is 7.6 GB/s}. The user logic and the DMA engine are
+configured by the host through PIO registers.
 
 
 The physical addresses of the host's memory buffers are stored into an internal
 The physical addresses of the host's memory buffers are stored into an internal
-memory and are dynamically updated by the driver, allowing highly efficient
-zero-copy data transfers. The maximum size associated with each address is 2 GB. 
+memory and are dynamically updated by the driver or user, allowing highly
+efficient zero-copy data transfers. The maximum size associated with each
+address is 2 GB.
 
 
 
 
 \subsection{OpenCL management on host side}
 \subsection{OpenCL management on host side}
 \label{sec:host}
 \label{sec:host}
 
 
 On the host side, AMD's DirectGMA technology, an implementation of the
 On the host side, AMD's DirectGMA technology, an implementation of the
-bus-addressable memory extension for OpenCL 1.1 and later, is used to write to
-GPU memory from the FPGA and FPGA control registers from the GPU.
+bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
+the FPGA to GPU memory and from the GPU to the FPGA's control registers.
 \figref{fig:opencl-setup} illustrates the main mode of operation: To write into
 \figref{fig:opencl-setup} illustrates the main mode of operation: To write into
 the GPU, the physical bus address of the GPU buffer is determined with a call to
 the GPU, the physical bus address of the GPU buffer is determined with a call to
-\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set in a control register
-of the FPGA (1). The FPGA then writes data blocks autonomously in DMA fashion
-(2). Due to hardware restrictions the largest possible GPU buffer sizes are
-about 95 MB but larger transfers can be achieved using a double buffering
-mechanism. To signal events to the FPGA (4), the control registers are mapped
-into the GPU's address space and seen as regular memory (3).
+\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
+control register of the FPGA (1). The FPGA then writes data blocks autonomously
+in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
+sizes are about 95 MB but larger transfers can be achieved using a double
+buffering mechanism. Because the GPU provides a flat memory address space and
+our DMA engine allows multiple destination addresses to be set in advance, we
+can determine all addresses before the actual transfers thus keeping the
+CPU out of the transfer loop.
+
+To signal events to the FPGA (4), the control registers can be mapped into the
+GPU's address space passing a special AMD-specific flag and passing the physical
+BAR address of the FPGA configuration memory to the \texttt{cl\-Create\-Buffer}
+function. From the GPU, this memory is seen transparently and as regular GPU
+memory and can be written accordingly (3).
 
 
 \begin{figure}
 \begin{figure}
   \centering
   \centering
@@ -163,16 +175,17 @@ To process the data, we encapsulated the DMA setup and memory mapping in a
 plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 framework allows for an easy construction of streamed data processing on
 framework allows for an easy construction of streamed data processing on
 heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
 heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
-it and run a Fourier transform on the GPU and write back the results, one can
-run \verb|ufo-launch direct-gma ! decode ! fft ! write filename=out.raw| on the
-command line. The accompanying framework application programming interface
-allows for straightforward integration into custom applications written in C or
-high-level languages such as Python. High throughput is achieved by the
-combination of fine- and coarse-grained data parallelism, \emph{i.e.} processing
-a single data item on a GPU using thousands of threads and by splitting the data
-stream and feeding individual data items to separate GPUs. None of this requires
-any user intervention and is solely determined by the framework in an
-automatized fashion.
+its specific format and run a Fourier transform on the GPU as well as writing
+back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
+fft ! write filename=out.raw} on the command line. The framework will take care
+of scheduling the tasks and distribute the data items according.  A
+complementary application programming interface allows users to develop custom
+applications written in C or high-level languages such as Python. High
+throughput is achieved by the combination of fine- and coarse-grained data
+parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
+of threads and by splitting the data stream and feeding individual data items to
+separate GPUs. None of this requires any user intervention and is solely
+determined by the framework in an automatized fashion.
 
 
 
 
 \section{Results}
 \section{Results}
@@ -226,7 +239,7 @@ compared to the previous setup, making it a more cost-effective solution.
   \includegraphics[width=0.6\textwidth]{figures/through_plot}
   \includegraphics[width=0.6\textwidth]{figures/through_plot}
   \caption{
   \caption{
     Writing from the FPGA to either system or GPU memory is primarily limited by
     Writing from the FPGA to either system or GPU memory is primarily limited by
-    the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth. 
+    the PCIe bus. Higher payloads introduce less overhead, thus increasing the net bandwidth.
     Up until 2 MB transfer size, the performance is almost the
     Up until 2 MB transfer size, the performance is almost the
     same, after that the GPU transfer shows a slightly better slope. Data
     same, after that the GPU transfer shows a slightly better slope. Data
     transfers larger than 1 GB saturate the PCIe bus.
     transfers larger than 1 GB saturate the PCIe bus.
@@ -244,7 +257,7 @@ compared to the previous setup, making it a more cost-effective solution.
   \includegraphics[width=0.6\textwidth]{figures/latency}
   \includegraphics[width=0.6\textwidth]{figures/latency}
   \caption{%
   \caption{%
     The data transmission latency is decreased by XXX percent with respect to the traditional
     The data transmission latency is decreased by XXX percent with respect to the traditional
-    approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time 
+    approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
     for a 4k pac
     for a 4k pac
   }
   }
   \label{fig:latency}
   \label{fig:latency}
@@ -258,7 +271,7 @@ between FPGA-based readout boards and GPU computing clusters. The net throughput
 limited by the PCIe bus, reaching 6.7 GB/s for a 256 B payload. By writing directly into GPU
 limited by the PCIe bus, reaching 6.7 GB/s for a 256 B payload. By writing directly into GPU
 memory instead of routing data through system main memory, latency is reduced by a factor of 2.
 memory instead of routing data through system main memory, latency is reduced by a factor of 2.
 The solution here proposed allows high performance GPU computing thanks to the support of the
 The solution here proposed allows high performance GPU computing thanks to the support of the
-framework. Integration with different DAQ systems and custom algorithms is therefore immediate.   
+framework. Integration with different DAQ systems and custom algorithms is therefore immediate.
 
 
 
 
 \subsection{Outlook}
 \subsection{Outlook}
@@ -270,18 +283,18 @@ A custom FPGA evaluation board is currently under development in order to
 increase the total throughput. The board mounts a Virtex-7 chip and features 2
 increase the total throughput. The board mounts a Virtex-7 chip and features 2
 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
 x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
 x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
-single x16 device by using an external PCIe switch. With two cores operating in parallel, 
-we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}). 
+single x16 device by using an external PCIe switch. With two cores operating in parallel,
+we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
 
 
 \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
 \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
 A big house for all these love-lacking protocols.}
 A big house for all these love-lacking protocols.}
 
 
-It is our intention to add Infiniband support. I NEED TO READ 
+It is our intention to add Infiniband support. I NEED TO READ
 WHAT ARE THE ADVANTAGES VS PCIe.
 WHAT ARE THE ADVANTAGES VS PCIe.
 
 
 \textbf{LR:Here comes the visionary Luigi...}
 \textbf{LR:Here comes the visionary Luigi...}
 Our goal is to develop a unique hybrid solution, based
 Our goal is to develop a unique hybrid solution, based
-on commercial standards, that includes fast data transmission protocols and a high performance 
+on commercial standards, that includes fast data transmission protocols and a high performance
 GPU computing framework.
 GPU computing framework.
 
 
 \acknowledgments
 \acknowledgments