lorenzo 8 жил өмнө
parent
commit
cd05b94fe1
1 өөрчлөгдсөн 197 нэмэгдсэн , 189 устгасан
  1. 197 189
      paper.tex

+ 197 - 189
paper.tex

@@ -34,23 +34,25 @@
     E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
     E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
 }
 }
 
 
-\abstract{ Modern physics experiments have reached multi-GB/s data rates. Fast
-data   links and high performance computing stages are required for continuous
-data   acquisition and processing. Because of their intrinsic parallelism and
-computational power, GPUs emerged as an ideal solution to process this data in
-high performance computing applications. In this paper we present a   high-
-throughput platform based on direct FPGA-GPU communication.    The
-architecture consists of a   Direct Memory Access (DMA) engine compatible with
-the Xilinx PCI-Express core,   a Linux driver for register access, and high-
-level software to manage direct   memory transfers using AMD's DirectGMA
-technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
-for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
-the possibility of using our architecture in low latency systems: preliminary
-measurements show a round-trip latency as low as 1 \textmu s for data
-transfers to system memory, while the additional latency introduced by OpenCL
-scheduling is the current limitation for GPU based systems.  Our
-implementation is suitable for real- time DAQ system applications ranging from
-photon science and medical imaging to High Energy Physics (HEP) systems.}
+\abstract{%
+  Modern physics experiments have reached multi-GB/s data rates. Fast data links
+  and high performance computing stages are required for continuous data
+  acquisition and processing. Because of their intrinsic parallelism and
+  computational power, GPUs emerged as an ideal solution to process this data in
+  high performance computing applications. In this paper we present a high-
+  throughput platform based on direct FPGA-GPU communication. The architecture
+  consists of a Direct Memory Access (DMA) engine compatible with the Xilinx
+  PCI-Express core, a Linux driver for register access, and high- level software
+  to manage direct memory transfers using AMD's DirectGMA technology.
+  Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s for transfers
+  to GPU memory and 6.6~GB/s to system memory. We also assessed the possibility
+  of using the architecture in low latency systems: preliminary measurements
+  show a round-trip latency as low as 1 \textmu s for data transfers to system
+  memory, while the additional latency introduced by OpenCL scheduling is the
+  current limitation for GPU based systems. Our implementation is suitable for
+  real-time DAQ system applications ranging from photon science and medical
+  imaging to High Energy Physics (HEP) systems.
+}
 
 
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 
 
@@ -69,69 +71,67 @@ due to an unprecedented parallelism and a low cost-benefit factor. GPU
 acceleration has found its way into numerous applications, ranging from
 acceleration has found its way into numerous applications, ranging from
 simulation to image processing. 
 simulation to image processing. 
 
 
-The data rates of bio-imaging or beam-monitoring experiments running in
-current generation photon science facilities have reached tens of
-GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
-back-end readout systems and then transmitted in short bursts or in a
-continuous streaming mode to a computing stage. In order to collect data over
-long observation times, the readout architecture and the computing stages must
-be able to sustain high data rates. Recent years have also seen an increasing
-interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
-applications, such as Low/High-level trigger systems, latency becomes
-the most stringent requirement.
-
-Due to its high bandwidth and modularity, PCIe quickly became the commercial
-standard for connecting high-throughput peripherals such as GPUs or solid
-state disks. Moreover, optical PCIe networks have been demonstrated a decade
+The data rates of bio-imaging or beam-monitoring experiments running in current
+generation photon science facilities have reached tens of GB/s~\cite{ufo_camera,
+caselle}. In a typical scenario, data are acquired by back-end readout systems
+and then transmitted in short bursts or continuously streamed to a computing
+stage. In order to collect data over long observation times, the readout
+architecture and the computing stages must be able to sustain high data rates.
+Recent years have also seen an increasing interest in GPU-based systems for High
+Energy Physics (HEP)  (\emph{e.g.} ATLAS~\cite{atlas_gpu},
+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and photon
+science experiments. In time-deterministic applications, such as Low/High-level
+trigger systems, latency becomes the most stringent requirement.
+
+Due to its high bandwidth and modularity, PCIe is the commercial \emph{de facto}
+standard for connecting high-throughput peripherals such as GPUs or solid state
+disks. Moreover, optical PCIe networks have been demonstrated a decade
 ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
 ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
 communication link over long distances.
 communication link over long distances.
 
 
-Several solutions for direct FPGA-GPU communication based on PCIe are reported
-in literature, and all of them are based on NVIDIA's GPUdirect technology. In
-the implementation of Bittnerner and Ruf ~\cite{bittner} the GPU acts as
-master during an FPGA-to-GPU data transfer, reading data from the FPGA.  This
-solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
-s, respectively. When the FPGA is used as a master, a higher throughput can be
-achieved.  An example of this approach is the \emph{FPGA\textsuperscript{2}}
-framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x
-Gen2.0 data link. Lonardo et~al.\ achieved low latencies with their NaNet
-design, an FPGA-based PCIe network interface card~\cite{lonardo2015nanet}. The
-Gbe link however limits the latency performance of the system to a few tens of
-\textmu s. If only the FPGA-to-GPU latency is considered, the measured values
-span between 1~\textmu s and 6~\textmu s, depending on the datagram size.
-Nieto et~al.\ presented a system based on a PXIexpress data link that makes
-use of four PCIe 1.0 links~\cite{nieto2015high}. Their system (as limited by
-the interconnect) achieves an average throughput of 870 MB/s with 1 KB block
-transfers.
+Several solutions for direct FPGA-GPU communication based on PCIe and NVIDIA's
+proprietary GPUdirect technology are reported in the literature. In the
+implementation of Bittner and Ruf the GPU acts as master during an FPGA-to-GPU
+read data transfer \cite{bittner}. This solution limits the reported bandwidth
+and latency to 514 MB/s and 40~\textmu s, respectively.  When the FPGA is used
+as a master, a higher throughput can be achieved.  An example of this approach
+is the \emph{FPGA\textsuperscript{2}} framework by Thoma et~al.\cite{thoma},
+which reaches 2454 MB/s using a PCIe 2.0 8x data link. Lonardo et~al.\ achieved
+low latencies with their NaNet design, an FPGA-based PCIe network interface
+card~\cite{lonardo2015nanet}. The Gbe link however limits the latency
+performance of the system to a few tens of \textmu s. If only the FPGA-to-GPU
+latency is considered, the measured values span between 1~\textmu s and
+6~\textmu s, depending on the datagram size.  Nieto et~al.\ presented a system
+based on a PXIexpress data link that makes use of four PCIe 1.0
+links~\cite{nieto2015high}. Their system, as limited by the interconnect,
+achieves an average throughput of 870 MB/s with 1 KB block transfers.
 
 
 In order to achieve the best performance in terms of latency and bandwidth, we
 In order to achieve the best performance in terms of latency and bandwidth, we
-developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
+developed a high-performance DMA engine based on Xilinx's PCIe 3.0 Core. To
 process the data, we encapsulated the DMA setup and memory mapping in a plugin
 process the data, we encapsulated the DMA setup and memory mapping in a plugin
 for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 framework allows for an easy construction of streamed data processing on
 framework allows for an easy construction of streamed data processing on
 heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
 heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
 integration with NVIDIA's CUDA functions for GPUDirect technology is not
 integration with NVIDIA's CUDA functions for GPUDirect technology is not
 possible at the moment. We therefore used AMD's DirectGMA technology to
 possible at the moment. We therefore used AMD's DirectGMA technology to
-integrate direct FPGA-to-GPU communication into our processing pipeline. In
-this paper we report the throughput performance of our architecture together
-with some preliminary measurements about DirectGMA's applicability in low-
-latency applications.
+integrate direct FPGA-to-GPU communication into our processing pipeline. In this
+paper we present the hardware/software interface and report the throughput
+performance of our architecture together with preliminary measurements about
+DirectGMA's applicability in low-latency applications.
 
 
 %% LR: this part -> OK
 %% LR: this part -> OK
 \section{Architecture}
 \section{Architecture}
 
 
 As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
 As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
-data through system main memory by copying data from the FPGA into
-intermediate buffers and then finally into the GPU's main memory. Thus, the
-total throughput and latency of the system is limited by the main memory
-bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
-communication between GPUs and auxiliary devices over PCIe. By combining this
-technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
-overall latency of the system is reduced and total throughput increased.
-Moreover, the CPU and main system memory are relieved from processing because
-they are not directly involved in the data transfer anymore.
+data through system main memory by copying data from the FPGA into intermediate
+buffers and then finally into the GPU's main memory. Thus, the total throughput
+and latency of the system is limited by the main memory bandwidth. NVIDIA's
+GPUDirect and AMD's DirectGMA technologies allow direct communication between
+GPUs and auxiliary devices over PCIe. By combining this technology with DMA data
+transfers as shown in \figref{fig:trad-vs-dgpu} (b), the overall latency of the
+system is reduced and total throughput increased. Moreover, the CPU and main
+system memory are relieved from processing because they are not directly
+involved in the data transfer anymore.
 
 
 \begin{figure}[t]
 \begin{figure}[t]
   \centering
   \centering
@@ -150,16 +150,14 @@ they are not directly involved in the data transfer anymore.
 
 
 We have developed a DMA engine that minimizes resource utilization while
 We have developed a DMA engine that minimizes resource utilization while
 maintaining the flexibility of a Scatter-Gather memory
 maintaining the flexibility of a Scatter-Gather memory
-policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}. The engine is compatible with the Xilinx PCIe
-Gen2/3 IP- Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA data
-transfers to/from main system memory and GPU memory are supported. Two FIFOs,
-with a data width of 256 bits and operating at 250 MHz, act as user- friendly
-interfaces with the custom logic with an input bandwidth of 7.45 GB/s. The
-user logic and the DMA engine are configured by the host through PIO
-registers. The resource
-utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
-
-
+policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}.
+The engine is compatible with the Xilinx PCIe 2.0/3.0 IP-Core~\cite{xilinxgen3}
+for Xilinx FPGA families 6 and 7. DMA data transfers between main system memory
+and GPU memory are supported. Two FIFOs,operating at 250 MHz and a data width of
+256 bits, act as user-friendly interfaces with the custom logic at an input
+bandwidth of 7.45 GB/s. The user logic and the DMA engine are configured by the
+host system through PIO registers. The resource utilization on a Virtex 7 device
+is reported in Table~\ref{table:utilization}.
 
 
 \begin{figure}[t]
 \begin{figure}[t]
 \small
 \small
@@ -182,7 +180,7 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
     \bottomrule
     \bottomrule
   \end{tabular}
   \end{tabular}
 }{%
 }{%
-  \caption{Resource utilization on a xc7vx690t-ffg1761 device}%
+  \caption{Resource utilization on a xc7vx690t-ffg1761 device.}%
   \label{table:utilization}
   \label{table:utilization}
 }
 }
 \end{floatrow}
 \end{floatrow}
@@ -210,63 +208,59 @@ address is 2 GB.
 \end{figure}
 \end{figure}
 
 
 %% Description of figure
 %% Description of figure
-On the host side, AMD's DirectGMA technology, an implementation of the bus-
-addressable memory extension for OpenCL 1.1 and later, is used to write from
+On the host side, AMD's DirectGMA technology, an implementation of the
+bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
 the FPGA to GPU memory and from the GPU to the FPGA's control registers.
 the FPGA to GPU memory and from the GPU to the FPGA's control registers.
-\figref{fig:opencl-setup} illustrates the main mode of operation: to write
-into the GPU, the physical bus addresses of the GPU buffers are determined
-with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
-host CPU in a control register of the FPGA (1). The FPGA then writes data
-blocks autonomously in DMA fashion (2).  To signal events to the FPGA (4), the
-control registers can be mapped into the GPU's address space passing a special
-AMD-specific flag and passing the physical BAR address of the FPGA
-configuration memory to the \texttt{cl\-Create\-Buffer} function. From the
-GPU, this memory is seen transparently as regular GPU memory and can be
-written accordingly (3). In our setup, trigger registers are used to notify
-the FPGA on successful or failed evaluation of the data. Using the
-\texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write
-entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
-as bus master and pushes data to the FPGA.
+\figref{fig:opencl-setup} illustrates the main mode of operation: to write into
+the GPU, the physical bus addresses of the GPU buffers are determined with a
+call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU
+in a control register of the FPGA (1). The FPGA then writes data blocks
+autonomously in DMA fashion (2). To signal events to the FPGA (4), the control
+registers can be mapped into the GPU's address space passing a special
+AMD-specific flag and passing the physical BAR address of the FPGA configuration
+memory to the \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is
+seen transparently as regular GPU memory and can be written accordingly (3). In
+our setup, trigger registers are used to notify the FPGA on successful or failed
+evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
+call it is possible to write entire memory regions in DMA fashion to the FPGA.
+In this case, the GPU acts as bus master and pushes data to the FPGA.
 
 
 %% Double Buffering strategy. 
 %% Double Buffering strategy. 
 
 
-Due to hardware restrictions the largest possible GPU buffer sizes are about
-95 MB but larger transfers can be achieved by using a double buffering
-mechanism: data are copied from the buffer exposed to the FPGA into a
-different location in GPU memory. To verify that we can keep up with the
-incoming data throughput using this strategy, we measured the data throughput
-within a GPU by copying data from a smaller sized buffer representing the DMA
-buffer to a larger destination buffer. At a block size of about 384 KB the
-throughput surpasses the maximum possible PCIe bandwidth, and it reaches 40
-GB/s for blocks bigger than 5 MB. Double buffering is therefore a viable
-solution for very large data transfers, where throughput performance is
-favoured over latency. For data sizes less than 95 MB, we can determine all
-addresses before the actual transfers thus keeping the CPU out of the transfer
-loop.
+Due to hardware restrictions with AMD W9100 FirePro cards, the largest possible
+GPU buffer sizes are about 95 MB. However, larger transfers can be achieved by
+using a double buffering mechanism: data are copied from the buffer exposed to
+the FPGA into a different location in GPU memory. To verify that we can keep up
+with the incoming data throughput using this strategy, we measured the data
+throughput within a GPU by copying data from a smaller sized buffer representing
+the DMA buffer to a larger destination buffer. At a block size of about 384 KB
+the throughput surpasses the maximum possible PCIe bandwidth. Block transfers
+larger than 5 MB saturate the bandwidth at 40 GB/s. Double buffering is
+therefore a viable solution for very large data transfers, where throughput
+performance is favoured over latency. For data sizes less than 95 MB, we can
+determine all addresses before the actual transfers thus keeping the CPU out of
+the transfer loop.
 
 
 %% Ufo Framework
 %% Ufo Framework
 To process the data, we encapsulated the DMA setup and memory mapping in a
 To process the data, we encapsulated the DMA setup and memory mapping in a
 plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
 plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
 This framework allows for an easy construction of streamed data processing on
 This framework allows for an easy construction of streamed data processing on
-heterogeneous multi-GPU systems. For example, to read data from the FPGA,
-decode from its specific data format and run a Fourier transform on the GPU as
-well as writing back the results to disk, one can run the following on the
-command line:
-
+heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
+from its specific data format and run a Fourier transform on the GPU as well as
+writing back the results to disk, one can run the following on the command line:
 
 
 \begin{verbatim}
 \begin{verbatim}
 ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
 ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
 \end{verbatim}
 \end{verbatim}
 
 
-The framework takes care of scheduling the tasks and distributing the data
-items to one or more GPUs. High throughput is achieved by the combination of
-fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
-data item on a GPU using thousands of threads and by splitting the data stream
-and feeding individual data items to separate GPUs. None of this requires any
-user intervention and is solely determined by the framework in an automatized
+The framework takes care of scheduling the tasks and distributing the data items
+to one or more GPUs. High throughput is achieved by the combination of fine- and
+coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
+GPU using thousands of threads and by splitting the data stream and feeding
+individual data items to separate GPUs. None of this requires any user
+intervention and is solely determined by the framework in an automatized
 fashion. A complementary application programming interface allows users to
 fashion. A complementary application programming interface allows users to
-develop custom applications written in C or high-level languages such as
-Python.
+develop custom applications in C or high-level languages such as Python.
 
 
 
 
 %% --------------------------------------------------------------------------
 %% --------------------------------------------------------------------------
@@ -276,7 +270,7 @@ Python.
 \begin{table}[b]
 \begin{table}[b]
 \centering
 \centering
 \small
 \small
-\caption{Setups used for throughput and latency measurements}
+\caption{Setups used for throughput and latency measurements.}
 \label{table:setups}
 \label{table:setups}
 \tabcolsep=0.11cm
 \tabcolsep=0.11cm
 \begin{tabular}{@{}llll@{}}
 \begin{tabular}{@{}llll@{}}
@@ -293,6 +287,7 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 \end{table}
 \end{table}
 
 
 We carried out performance measurements on two different setups, which are
 We carried out performance measurements on two different setups, which are
+<<<<<<< HEAD
 described in table~\ref{table:setups}. In both setups, a Xilinx VC709
 described in table~\ref{table:setups}. In both setups, a Xilinx VC709
 evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
 evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
 into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
 into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
@@ -303,8 +298,19 @@ NVIDIA's GPUDirect documentation, the devices must share the same RC to
 achieve the best performance.  In case of FPGA-to-CPU data
 achieve the best performance.  In case of FPGA-to-CPU data
 transfers, the software implementation is the one described
 transfers, the software implementation is the one described
 in~\cite{rota2015dma}.
 in~\cite{rota2015dma}.
+=======
+described in table~\ref{table:setups}. In both setups, a Xilinx VC709 evaluation
+board was used. In Setup 1, the FPGA board and the GPU were plugged into a PCIe
+3.0 slot, but they were connected to different PCIe Root Complexes (RC). In
+Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a Netstor
+NA255A external PCIe enclosure. As opposed to Setup 1, both the FPGA board and
+the GPU were connected to the same RC. As stated in NVIDIA's GPUDirect
+documentation, the devices must share the same RC to achieve the best
+performance~\cite{cuda_doc}. In case of FPGA-to-CPU data transfers, the software
+implementation is the one described in~\cite{rota2015dma}.
+
+>>>>>>> 0cc20c7925a1e5d52c1884a3dbca3847d99d96c2
 
 
-%% --------------------------------------------------------------------------
 \subsection{Throughput}
 \subsection{Throughput}
 
 
 \begin{figure}[t]
 \begin{figure}[t]
@@ -316,23 +322,22 @@ in~\cite{rota2015dma}.
 \label{fig:throughput}
 \label{fig:throughput}
 \end{figure}
 \end{figure}
 
 
-In order to evaluate the maximum performance of the DMA engine, measurements
-of pure data throughput were carried out using Setup 1. The results are shown
-in \figref{fig:throughput} for transfers to the system's main memory as well
-as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
-double buffering mechanism was used. As one can see, in both cases the write
+In order to evaluate the maximum performance of the DMA engine, measurements of
+pure data throughput were carried out using Setup 1. The results are shown in
+\figref{fig:throughput} for transfers to the system's main memory as well as to
+the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the double
+buffering mechanism was used. As one can see, in both cases the write
 performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
 performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
-size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
+size, the throughput to the GPU is slowly approaching 100 MB/s. From there on,
 the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
 the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
-throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
-and maximum performance depend on the different implementation of the
-handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
-Gen1 link limits the throughput to system main memory to around 700 MB/s.
-However, transfers to GPU memory yielded the same results as Setup 1.
+throughput saturates earlier at a maximum throughput of 6.6 GB/s. The slope and
+maximum performance depend on the different implementation of the handshaking
+sequence between DMA engine and the hosts. With Setup 2, the PCIe 1.0 link
+limits the throughput to system main memory to around 700 MB/s.  However,
+transfers to GPU memory yielded the same results as Setup 1.
 
 
-%% --------------------------------------------------------------------------
-\subsection{Latency}
 
 
+\subsection{Latency}
 
 
 \begin{figure}[t]
 \begin{figure}[t]
   \centering
   \centering
@@ -342,52 +347,57 @@ However, transfers to GPU memory yielded the same results as Setup 1.
   
   
     \label{fig:latency-cpu}
     \label{fig:latency-cpu}
   \vspace{-0.4\baselineskip}
   \vspace{-0.4\baselineskip}
-    \caption{}
+    \caption{System memory transfer latency}
   \end{subfigure}
   \end{subfigure}
   \begin{subfigure}[b]{.49\textwidth}
   \begin{subfigure}[b]{.49\textwidth}
     \includegraphics[width=\textwidth]{figures/latency-gpu}
     \includegraphics[width=\textwidth]{figures/latency-gpu}
 
 
     \label{fig:latency-gpu}
     \label{fig:latency-gpu}
   \vspace{-0.4\baselineskip}
   \vspace{-0.4\baselineskip}
-    \caption{}
+    \caption{GPU memory transfer latency}
     \end{subfigure}
     \end{subfigure}
-  \caption{Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).}
+    \caption{%
+      Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).
+    }
   \label{fig:latency}
   \label{fig:latency}
 \end{figure}
 \end{figure}
 
 
-
-We conducted the following test in order to measure the latency introduced by the DMA engine : 
-1) the host starts a DMA transfer by issuing the \emph{start\_dma} command.
-2) the DMA engine transmits data into the system main memory.
-3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
-4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
+We conducted the following test in order to measure the latency introduced by the DMA engine: 
+1) the host starts a DMA transfer by issuing the \emph{start\_dma} command,
+2) the DMA engine transmits data into the system main memory,
+3) when all the data has been transferred, the DMA engine notifies the host that
+new data is present by writing into a specific address in the system main
+memory,
+4) the host acknowledges that data has been received by issuing the \emph{stop\_dma} command.
 
 
 A counter on the FPGA measures the time interval between the \emph{start\_dma}
 A counter on the FPGA measures the time interval between the \emph{start\_dma}
-and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
-the round-trip latency of the system. The correct ordering of the packets is
-assured by the PCIe protocol. The measured round-trip latencies for data transfers to
-system main memory and GPU memory are reported in \figref{fig:latency}.
+and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring the
+round-trip latency of the system. The correct ordering of the packets is
+guaranteed by the PCIe protocol. The measured round-trip latencies for data
+transfers to system main memory and GPU memory are shown in
+\figref{fig:latency}.
 
 
-When system main memory is used, latencies as low as 1.1 \textmu s are
-achieved with Setup 1 for a packet size of 1024 B. The higher latency and the
-dependance on size measured with Setup 2 are caused by the slower PCIe x4 Gen1
-link connecting the FPGA board to the system main memory.
+With Setup 1 and system memory, latencies as low as 1.1 \textmu s can be
+achieved for a packet size of 1024 B. Higher latencies and a dependency on size
+measured with Setup 2 are caused by the slower PCIe x4 1.0 link connecting the
+FPGA board to the system main memory.
 
 
 The same test was performed when transferring data inside GPU memory. Like in
 The same test was performed when transferring data inside GPU memory. Like in
-the previous case, the notification was written into systen main memory. This
-approach was used because the latency introduced by OpenCL scheduling in our
-implementation (\~ 100-200 \textmu s) did not allow a precise measurement
-based only on FPGA-GPU communication. When connecting the devices to the same
-RC, as in Setup 2, a latency of 2 \textmu is achieved (limited by the latency
-to system main memory, as seen in \figref{fig:latency}.a). On the contrary, if
-the FPGA board and the GPU are connected to different RC as in Setup 1, the
-latency increases significantly with packet size. It must be noted that the
-low latencies measured with Setup 1 for packet sizes below 1 kB seem to be due
-to a caching mechanism  inside the PCIe switch, and it is not clear whether
-data has been successfully written into GPU memory when the notification is
-delivered to the CPU. This effect must be taken into account in future
-implementations as it could potentially lead to data corruption.
+the previous case, the notification was written into main memory. This approach
+was used because a latency of 100 to 200 \textmu s introduced by OpenCL
+scheduling did not allow a precise measurement based only on FPGA-to-GPU
+communication. When connecting the devices to the same RC, as in Setup 2, a
+latency of 2 \textmu s is achieved and limited by the latency to system main
+memory, as seen in \figref{fig:latency} (a). On the contrary, if the FPGA board
+and the GPU are connected to a different RC as in Setup 1, the latency increases
+significantly with packet size. It must be noted that the low latencies measured
+with Setup 1 for packet sizes below 1 kB seem to be due to a caching mechanism
+inside the PCIe switch, and it is not clear whether data has been successfully
+written into GPU memory when the notification is delivered to the CPU. This
+effect must be taken into account in future implementations as it could
+potentially lead to data corruption.
  
  
+
 \section{Conclusion and outlook}
 \section{Conclusion and outlook}
 
 
 We developed a hardware and software solution that enables DMA transfers
 We developed a hardware and software solution that enables DMA transfers
@@ -400,39 +410,37 @@ showed no significant difference in throughput performance. Depending on the
 application and computing requirements, this result makes smaller acquisition
 application and computing requirements, this result makes smaller acquisition
 system a cost-effective alternative to larger workstations.
 system a cost-effective alternative to larger workstations.
 
 
-We measured a round-trip latency of 1 \textmu s when transfering data between
-the DMA engine with system main memory. We also assessed the applicability of
-DirectGMA in low latency applications: preliminary results shows that
-latencies as low as 2 \textmu s can by achieved during data transfers to GPU
-memory.  However, at the time of writing this paper, the latency introduced by
-OpenCL scheduling is in the range of hundreds of \textmu s. Optimization of
-the GPU-DMA interfacing OpenCL code is ongoing with the help of technical
-support by AMD, in order to lift the current limitation and enable the use of
-our implementation in low latency applications. Moreover, measurements show
-that dedicated hardware must be employed in low latency applications.
+We measured a round-trip latency of 1 \textmu s when transferring data between
+the DMA engine and system memory. We also assessed the applicability of
+DirectGMA in low latency applications: preliminary results shows that latencies
+as low as 2 \textmu s can be achieved during data transfers to GPU memory.
+However, at the time of writing this paper, the latency introduced by OpenCL
+scheduling is in the range of hundreds of \textmu s. In order to lift this
+limitation and make our implementation useful in low-latency applications, we
+are currently optimizing the the GPU-DMA interfacing OpenCL code with the help
+of technical support by AMD. Moreover, measurements show that dedicated
+connecting hardware must be employed in low latency applications.
 
 
 In order to increase the total throughput, a custom FPGA evaluation board is
 In order to increase the total throughput, a custom FPGA evaluation board is
 currently under development. The board mounts a Virtex-7 chip and features two
 currently under development. The board mounts a Virtex-7 chip and features two
 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
-Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
+3.0 connection. Two PCIe x8 3.0 cores, instantiated on the board, will be mapped
 as a single x16 device by using an external PCIe switch. With two cores
 as a single x16 device by using an external PCIe switch. With two cores
 operating in parallel, we foresee an increase in the data throughput by a
 operating in parallel, we foresee an increase in the data throughput by a
-factor of 2 (as demonstrated in~\cite{rota2015dma}).
-
-The proposed software solution allows seamless multi-GPU processing of
-the incoming data, due to the integration in our streamed computing framework.
-This allows straightforward integration with different DAQ systems and
-introduction of custom data processing algorithms.
-
-Support for NVIDIA's GPUDirect technology is also foreseen in the next months
-to lift the limitation of one specific GPU vendor and compare the performance
-of hardware by different vendors. Further improvements are expected by
-generalizing the transfer mechanism and include Infiniband support besides the
-existing PCIe connection.
-
-Our goal is to develop a unique hybrid solution,
-based on commercial standards, that includes fast data transmission protocols
-and a high performance GPU computing framework.
+factor of two as demonstrated in~\cite{rota2015dma}.
+
+The proposed software solution allows seamless multi-GPU processing of the
+incoming data, due to the integration in our streamed computing framework.  This
+allows straightforward integration with different DAQ systems and introduction
+of custom data processing algorithms. Support for NVIDIA's GPUDirect technology
+is also foreseen in the next months to lift the limitation of one specific GPU
+vendor and compare the performance of hardware by different vendors. Further
+improvements are expected by generalizing the transfer mechanism and include
+InfiniBand support besides the existing PCIe connection.
+
+Our goal is to develop a unique hybrid solution, based on commercial standards,
+that includes fast data transmission protocols and a high performance GPU
+computing framework.
 
 
 
 
 \acknowledgments
 \acknowledgments