|
@@ -111,19 +111,16 @@ links~\cite{nieto2015high}. Their system (as limited by the interconnect)
|
|
|
achieves an average throughput of 870 MB/s with 1 KB block transfers.
|
|
|
|
|
|
In order to achieve the best performance in terms of latency and bandwidth, we
|
|
|
-developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.
|
|
|
-
|
|
|
-To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
|
-plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
|
|
|
-This framework allows for an easy construction of streamed data processing on
|
|
|
+developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.To
|
|
|
+process the data, we encapsulated the DMA setup and memory mapping in a plugin
|
|
|
+for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
+framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. The framework is based on OpenCL, and
|
|
|
integration with NVIDIA's CUDA functions for GPUDirect technology is not
|
|
|
-possible.
|
|
|
-
|
|
|
-We therefore integrated direct FPGA-to-GPU communication into our processing
|
|
|
-pipeline using AMD's DirectGMA technology. In this paper we report the
|
|
|
-performance of our DMA engine for FPGA-to-CPU communication and the first
|
|
|
-preliminary results with DirectGMA technology.
|
|
|
+possible. We therefore integrated direct FPGA-to-GPU communication into our
|
|
|
+processing pipeline using AMD's DirectGMA technology. In this paper we report
|
|
|
+the performance of our DMA engine for FPGA-to-CPU communication and some
|
|
|
+preliminary measurements about DirectGMA's performance in low-latency applications.
|
|
|
|
|
|
\section{Architecture}
|
|
|
|
|
@@ -254,41 +251,55 @@ E5-1630 at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux
|
|
|
3.0 x8 slots. In case of FPGA-to-CPU data transfers, the software
|
|
|
implementation is the one described in~\cite{rota2015dma}.
|
|
|
|
|
|
+\subsection{Throughput}
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
\begin{figure}
|
|
|
- \centering
|
|
|
- \begin{subfigure}[b]{.49\textwidth}
|
|
|
- \centering
|
|
|
- \includegraphics[width=\textwidth]{figures/throughput}
|
|
|
- \caption{%
|
|
|
- DMA data transfer throughput.
|
|
|
- }
|
|
|
- \label{fig:throughput}
|
|
|
- \end{subfigure}
|
|
|
- \begin{subfigure}[b]{.49\textwidth}
|
|
|
- \includegraphics[width=\textwidth]{figures/latency}
|
|
|
- \caption{%
|
|
|
- Latency distribution.
|
|
|
- % for a single 4 KB packet transferred
|
|
|
- % from FPGA-to-CPU and FPGA-to-GPU.
|
|
|
- }
|
|
|
- \label{fig:latency}
|
|
|
- \end{subfigure}
|
|
|
+ \includegraphics[width=\textwidth]{figures/throughput}
|
|
|
\caption{%
|
|
|
Measured results for data transfers from FPGA to main memory
|
|
|
(CPU) and from FPGA to the global GPU memory (GPU).
|
|
|
- }
|
|
|
+}
|
|
|
+\label{fig:throughput}
|
|
|
\end{figure}
|
|
|
|
|
|
+% \begin{figure}
|
|
|
+% \centering
|
|
|
+% \begin{subfigure}[b]{.49\textwidth}
|
|
|
+% \centering
|
|
|
+% \includegraphics[width=\textwidth]{figures/throughput}
|
|
|
+% \caption{%
|
|
|
+% DMA data transfer throughput.
|
|
|
+% }
|
|
|
+% \label{fig:throughput}
|
|
|
+% \end{subfigure}
|
|
|
+% \begin{subfigure}[b]{.49\textwidth}
|
|
|
+% \includegraphics[width=\textwidth]{figures/latency}
|
|
|
+% \caption{%
|
|
|
+% Latency distribution.
|
|
|
+% % for a single 4 KB packet transferred
|
|
|
+% % from FPGA-to-CPU and FPGA-to-GPU.
|
|
|
+% }
|
|
|
+% \label{fig:latency}
|
|
|
+% \end{subfigure}
|
|
|
+% \caption{%
|
|
|
+% Measured throuhput for data transfers from FPGA to main memory
|
|
|
+% (CPU) and from FPGA to the global GPU memory (GPU).
|
|
|
+% }
|
|
|
+% \end{figure}
|
|
|
+
|
|
|
The measured results for the pure data throughput is shown in
|
|
|
\figref{fig:throughput} for transfers from the FPGA to the system's main
|
|
|
-memory as well as to the global memory as explained in \ref{sec:host}. As one
|
|
|
-can see, in both cases the write performance is primarily limited by the PCIe
|
|
|
-bus. Higher payloads make up for the constant overhead thus increasing the net
|
|
|
-bandwidth. Up until 2 MB data transfer size, the throughput to the GPU is
|
|
|
-approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
|
|
|
-GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
|
|
|
-throughput saturates earlier at about 30 MB but the maximum throughput is
|
|
|
-limited to about 6 GB/s losing about 6\% write performance.
|
|
|
+memory as well as to the global memory as explained in \ref{sec:host}.
|
|
|
+% Must ask Suren about this
|
|
|
+
|
|
|
+In the case of FPGA-to-GPU data transfers, the double buffering solution was
|
|
|
+used. As one can see, in both cases the write performance is primarily limited
|
|
|
+by the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU
|
|
|
+is approaching slowly 100 MB/s. From there on, the throughput increases up to
|
|
|
+6.4 GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
|
|
|
+throughput saturates earlier but the maximum throughput is 6.6 GB/s.
|
|
|
|
|
|
% We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
|
|
|
% system based on an Intel Atom CPU. The results showed no significant difference
|
|
@@ -325,6 +336,8 @@ latency.
|
|
|
% three sizes and an increasing block size.
|
|
|
|
|
|
|
|
|
+\subsection{Latency}
|
|
|
+
|
|
|
\begin{figure}
|
|
|
\includegraphics[width=\textwidth]{figures/latency-hist}
|
|
|
\caption{%
|