8 years ago · 59921d3383
--- a/paper.tex
+++ b/paper.tex
@@ -45,8 +45,7 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 
				 level software to manage direct   memory transfers using AMD's DirectGMA
			
 
				 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
			
 
				 for transfers to GPU memory and 6.6~GB/s to system memory.  We
			
 
				-also evaluated DirectGMA performance for low latency applications: preliminary
			
 
				-results show a round-trip latency of 2 \textmu s for data sizes up to 4 kB.
			
 
				+also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s. 
			
 
				 Our implementation is suitable for real- time DAQ system applications ranging
			
 
				 from photon science and medical imaging to High Energy Physics (HEP) trigger
			
 
				 systems. }
			
@@ -348,18 +347,17 @@ The different slope and saturation point is a direct consequence of the differen
 
				 
			
 
				 \begin{figure}[t]
			
 
				   \centering
			
 
				-  \begin{subfigure}[b]{.45\textwidth}
			
 
				+  \begin{subfigure}[b]{.49\textwidth}
			
 
				     \centering
			
 
				     \includegraphics[width=\textwidth]{figures/latency}
			
 
				-    \caption{Latency }
			
 
				     \label{fig:latency_vs_size}
			
 
				   \end{subfigure}
			
 
				-  \begin{subfigure}[b]{.45\textwidth}
			
 
				+  \begin{subfigure}[b]{.49\textwidth}
			
 
				     \includegraphics[width=\textwidth]{figures/latency-hist}
			
 
				-    \caption{Latency distribution.}
			
 
				     \label{fig:latency_hist}
			
 
				   \end{subfigure}
			
 
				-  \label{fig:latency}
			
 
				+  \caption{Measured round-trip latency for FPGA to main memory data transfers.}
			
 
				+  \label{fig:cpu_latency}
			
 
				 \end{figure}
			
 
				 
			
 
				 
			
@@ -373,14 +371,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 
				 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
			
 
				 the round-trip latency of the system.
			
 
				 
			
 
				-The distribution of 10000 measurements of the one-way latency is shown in
			
 
				-\figref{fig:latency-hist}.  The GPU latency has a mean value of 84.38 \textmu
			
 
				-s and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the
			
 
				-CPU latency of 76.89 \textmu s that was measured using the same driver and
			
 
				-measuring procedure. The non-Gaussian distribution with two distinct peaks
			
 
				-indicates a systemic influence that we cannot control and is most likely
			
 
				-caused by the non-deterministic run-time behaviour of the operating system
			
 
				-scheduler.
			
 
				+The results of 1000 measurements of the round-trip latency are shown in
			
 
				+\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
			
 
				+standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
			
 
				+systemic influence that we cannot control and is most likely caused by the
			
 
				+non-deterministic run-time behaviour of the operating system scheduler.
			
 
				+
			
 
				 
			
 
				 \section{Conclusion and outlook}
			
 
				 
			
@@ -396,10 +392,14 @@ system a cost-effective alternative to larger workstations.
 
				  
			
 
				 We also evaluated the performance of DirectGMA technology for low latency
			
 
				 applications. Preliminary results indicate that latencies as low as 2 \textmu
			
 
				-s can be achieved in  data transfer to GPU memory. As opposed to the previous
			
 
				-case, measurements show that, for latency applications, dedicated hardware
			
 
				-must be used in order to achieve the best performance. Optimization of the
			
 
				-GPU-DMA interfacing code is ongoing with the help of technical support by AMD.
			
 
				+s can be achieved in data transfer to system main memory. However, at the time
			
 
				+of writing this paper, the latency introduced by DirectGMA OpenCL functions is
			
 
				+in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
			
 
				+interfacing code is ongoing with the help of technical support by AMD, in
			
 
				+order to lift the limitation introduced by OpenCL scheduling and make our
			
 
				+implementation suitable for low latency applications. Moreover, as opposed to
			
 
				+the previous case, measurements show that for low latency applications
			
 
				+dedicated hardware must be used in order to achieve the best performance.
			
 
				 
			
 
				 In order to increase the total throughput, a custom FPGA evaluation board is
			
 
				 currently under development. The board mounts a Virtex-7 chip and features two