Browse Source

everything sucks

Lorenzo 8 years ago
parent
commit
59921d3383
1 changed files with 19 additions and 19 deletions
  1. 19 19
      paper.tex

+ 19 - 19
paper.tex

@@ -45,8 +45,7 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 level software to manage direct   memory transfers using AMD's DirectGMA
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
 for transfers to GPU memory and 6.6~GB/s to system memory.  We
-also evaluated DirectGMA performance for low latency applications: preliminary
-results show a round-trip latency of 2 \textmu s for data sizes up to 4 kB.
+also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s. 
 Our implementation is suitable for real- time DAQ system applications ranging
 from photon science and medical imaging to High Energy Physics (HEP) trigger
 systems. }
@@ -348,18 +347,17 @@ The different slope and saturation point is a direct consequence of the differen
 
 \begin{figure}[t]
   \centering
-  \begin{subfigure}[b]{.45\textwidth}
+  \begin{subfigure}[b]{.49\textwidth}
     \centering
     \includegraphics[width=\textwidth]{figures/latency}
-    \caption{Latency }
     \label{fig:latency_vs_size}
   \end{subfigure}
-  \begin{subfigure}[b]{.45\textwidth}
+  \begin{subfigure}[b]{.49\textwidth}
     \includegraphics[width=\textwidth]{figures/latency-hist}
-    \caption{Latency distribution.}
     \label{fig:latency_hist}
   \end{subfigure}
-  \label{fig:latency}
+  \caption{Measured round-trip latency for FPGA to main memory data transfers.}
+  \label{fig:cpu_latency}
 \end{figure}
 
 
@@ -373,14 +371,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
 the round-trip latency of the system.
 
-The distribution of 10000 measurements of the one-way latency is shown in
-\figref{fig:latency-hist}.  The GPU latency has a mean value of 84.38 \textmu
-s and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the
-CPU latency of 76.89 \textmu s that was measured using the same driver and
-measuring procedure. The non-Gaussian distribution with two distinct peaks
-indicates a systemic influence that we cannot control and is most likely
-caused by the non-deterministic run-time behaviour of the operating system
-scheduler.
+The results of 1000 measurements of the round-trip latency are shown in
+\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
+standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
+systemic influence that we cannot control and is most likely caused by the
+non-deterministic run-time behaviour of the operating system scheduler.
+
 
 \section{Conclusion and outlook}
 
@@ -396,10 +392,14 @@ system a cost-effective alternative to larger workstations.
  
 We also evaluated the performance of DirectGMA technology for low latency
 applications. Preliminary results indicate that latencies as low as 2 \textmu
-s can be achieved in  data transfer to GPU memory. As opposed to the previous
-case, measurements show that, for latency applications, dedicated hardware
-must be used in order to achieve the best performance. Optimization of the
-GPU-DMA interfacing code is ongoing with the help of technical support by AMD.
+s can be achieved in data transfer to system main memory. However, at the time
+of writing this paper, the latency introduced by DirectGMA OpenCL functions is
+in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
+interfacing code is ongoing with the help of technical support by AMD, in
+order to lift the limitation introduced by OpenCL scheduling and make our
+implementation suitable for low latency applications. Moreover, as opposed to
+the previous case, measurements show that for low latency applications
+dedicated hardware must be used in order to achieve the best performance.
 
 In order to increase the total throughput, a custom FPGA evaluation board is
 currently under development. The board mounts a Virtex-7 chip and features two