|
@@ -45,8 +45,7 @@ the Xilinx PCI-Express core, a Linux driver for register access, and high-
|
|
|
level software to manage direct memory transfers using AMD's DirectGMA
|
|
|
technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
|
|
|
for transfers to GPU memory and 6.6~GB/s to system memory. We
|
|
|
-also evaluated DirectGMA performance for low latency applications: preliminary
|
|
|
-results show a round-trip latency of 2 \textmu s for data sizes up to 4 kB.
|
|
|
+also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s.
|
|
|
Our implementation is suitable for real- time DAQ system applications ranging
|
|
|
from photon science and medical imaging to High Energy Physics (HEP) trigger
|
|
|
systems. }
|
|
@@ -348,18 +347,17 @@ The different slope and saturation point is a direct consequence of the differen
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
|
- \begin{subfigure}[b]{.45\textwidth}
|
|
|
+ \begin{subfigure}[b]{.49\textwidth}
|
|
|
\centering
|
|
|
\includegraphics[width=\textwidth]{figures/latency}
|
|
|
- \caption{Latency }
|
|
|
\label{fig:latency_vs_size}
|
|
|
\end{subfigure}
|
|
|
- \begin{subfigure}[b]{.45\textwidth}
|
|
|
+ \begin{subfigure}[b]{.49\textwidth}
|
|
|
\includegraphics[width=\textwidth]{figures/latency-hist}
|
|
|
- \caption{Latency distribution.}
|
|
|
\label{fig:latency_hist}
|
|
|
\end{subfigure}
|
|
|
- \label{fig:latency}
|
|
|
+ \caption{Measured round-trip latency for FPGA to main memory data transfers.}
|
|
|
+ \label{fig:cpu_latency}
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
@@ -373,14 +371,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
|
|
|
and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
|
|
|
the round-trip latency of the system.
|
|
|
|
|
|
-The distribution of 10000 measurements of the one-way latency is shown in
|
|
|
-\figref{fig:latency-hist}. The GPU latency has a mean value of 84.38 \textmu
|
|
|
-s and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the
|
|
|
-CPU latency of 76.89 \textmu s that was measured using the same driver and
|
|
|
-measuring procedure. The non-Gaussian distribution with two distinct peaks
|
|
|
-indicates a systemic influence that we cannot control and is most likely
|
|
|
-caused by the non-deterministic run-time behaviour of the operating system
|
|
|
-scheduler.
|
|
|
+The results of 1000 measurements of the round-trip latency are shown in
|
|
|
+\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
|
|
|
+standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
|
|
|
+systemic influence that we cannot control and is most likely caused by the
|
|
|
+non-deterministic run-time behaviour of the operating system scheduler.
|
|
|
+
|
|
|
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
@@ -396,10 +392,14 @@ system a cost-effective alternative to larger workstations.
|
|
|
|
|
|
We also evaluated the performance of DirectGMA technology for low latency
|
|
|
applications. Preliminary results indicate that latencies as low as 2 \textmu
|
|
|
-s can be achieved in data transfer to GPU memory. As opposed to the previous
|
|
|
-case, measurements show that, for latency applications, dedicated hardware
|
|
|
-must be used in order to achieve the best performance. Optimization of the
|
|
|
-GPU-DMA interfacing code is ongoing with the help of technical support by AMD.
|
|
|
+s can be achieved in data transfer to system main memory. However, at the time
|
|
|
+of writing this paper, the latency introduced by DirectGMA OpenCL functions is
|
|
|
+in the range of hundreds of \textmu s. Optimization of the GPU-DMA
|
|
|
+interfacing code is ongoing with the help of technical support by AMD, in
|
|
|
+order to lift the limitation introduced by OpenCL scheduling and make our
|
|
|
+implementation suitable for low latency applications. Moreover, as opposed to
|
|
|
+the previous case, measurements show that for low latency applications
|
|
|
+dedicated hardware must be used in order to achieve the best performance.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|