|
@@ -279,22 +279,11 @@ Python.
|
|
|
%% --------------------------------------------------------------------------
|
|
|
\section{Results}
|
|
|
|
|
|
-We carried out performance measurements on two different setups, which are
|
|
|
-described in table~\ref{table:setups}. In both setups, a Xilinx VC709
|
|
|
-evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
|
|
|
-into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
|
|
|
-(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
|
|
|
-Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
|
|
|
-were connected to the same RC, as opposed to Setup 1. As stated in the
|
|
|
-NVIDIA's GPUDirect documentation, the devices must share the same RC to
|
|
|
-achieve the best performance~\cite{cuda_doc}. In case of FPGA-to-CPU data
|
|
|
-transfers, the software implementation is the one described
|
|
|
-in~\cite{rota2015dma}.
|
|
|
|
|
|
-\begin{table}[]
|
|
|
+\begin{table}[b]
|
|
|
\centering
|
|
|
\small
|
|
|
-\caption{Description of the measurement setup}
|
|
|
+\caption{Setups used for throughput and latency measurements}
|
|
|
\label{table:setups}
|
|
|
\tabcolsep=0.11cm
|
|
|
\begin{tabular}{@{}llll@{}}
|
|
@@ -304,12 +293,24 @@ in~\cite{rota2015dma}.
|
|
|
CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
|
|
|
Chipset & Intel C612 & Intel ICH9R Express \\
|
|
|
GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
|
|
|
-PCIe slot: System memory & x8 Gen3 (same RC) & x4 Gen1 (different RC) \\
|
|
|
+PCIe slot: System memory & x8 Gen3 & x4 Gen1 \\
|
|
|
PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
|
|
|
\bottomrule
|
|
|
\end{tabular}
|
|
|
\end{table}
|
|
|
|
|
|
+We carried out performance measurements on two different setups, which are
|
|
|
+described in table~\ref{table:setups}. In both setups, a Xilinx VC709
|
|
|
+evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
|
|
|
+into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
|
|
|
+(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
|
|
|
+Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
|
|
|
+were connected to the same RC, as opposed to Setup 1. As stated in the
|
|
|
+NVIDIA's GPUDirect documentation, the devices must share the same RC to
|
|
|
+achieve the best performance~\cite{cuda_doc}. In case of FPGA-to-CPU data
|
|
|
+transfers, the software implementation is the one described
|
|
|
+in~\cite{rota2015dma}.
|
|
|
+
|
|
|
%% --------------------------------------------------------------------------
|
|
|
\subsection{Throughput}
|
|
|
|
|
@@ -322,7 +323,7 @@ PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
|
|
|
\label{fig:throughput}
|
|
|
\end{figure}
|
|
|
|
|
|
-In order to evaluate the performance of the DMA engine, measurements of pure
|
|
|
+In order to evaluate the maximum performance of the DMA engine, measurements of pure
|
|
|
data throughput were carried out using Setup 1. The results are shown in
|
|
|
\figref{fig:throughput} for transfers to the system's main memory as well as
|
|
|
to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
|
|
@@ -342,36 +343,57 @@ handshaking sequence between DMA engine and the hosts.
|
|
|
\centering
|
|
|
\begin{subfigure}[b]{.49\textwidth}
|
|
|
\centering
|
|
|
- \includegraphics[width=\textwidth]{figures/latency}
|
|
|
- \label{fig:latency_vs_size}
|
|
|
+ \includegraphics[width=\textwidth]{figures/latency-cpu}
|
|
|
+
|
|
|
+ \label{fig:latency-cpu}
|
|
|
+ \vspace{-0.4\baselineskip}
|
|
|
+ \caption{}
|
|
|
\end{subfigure}
|
|
|
\begin{subfigure}[b]{.49\textwidth}
|
|
|
- \includegraphics[width=\textwidth]{figures/latency-hist}
|
|
|
- \label{fig:latency_hist}
|
|
|
- \end{subfigure}
|
|
|
- \caption{Measured round-trip latency for FPGA to main memory data transfers.}
|
|
|
- \label{fig:cpu_latency}
|
|
|
+ \includegraphics[width=\textwidth]{figures/latency-gpu}
|
|
|
+
|
|
|
+ \label{fig:latency-gpu}
|
|
|
+ \vspace{-0.4\baselineskip}
|
|
|
+ \caption{}
|
|
|
+ \end{subfigure}
|
|
|
+ \caption{Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).}
|
|
|
+ \label{fig:latency}
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
We conducted the following test in order to measure the latency introduced by the DMA engine :
|
|
|
-1) the host starts a DMA transfer by issuing the \emph{start\_dma} command
|
|
|
-2) the DMA engine transmits data into the system main memory
|
|
|
-3) the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory
|
|
|
-4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command
|
|
|
+1) the host starts a DMA transfer by issuing the \emph{start\_dma} command.
|
|
|
+2) the DMA engine transmits data into the system main memory.
|
|
|
+3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
|
|
|
+4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
|
|
|
|
|
|
+The correct ordering of the packets is assured by the PCIe protocol.
|
|
|
A counter on the FPGA measures the time interval between the \emph{start\_dma}
|
|
|
and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
|
|
|
-the round-trip latency of the system.
|
|
|
-
|
|
|
-The results of 1000 measurements of the round-trip latency using system memory
|
|
|
-are shown in \figref{fig:latency-hist}. The latency for Setup 1 and Setup 2
|
|
|
-are, respectively, XXX \textmu s and XXX \textmu s. The non-Gaussian
|
|
|
-distribution indicates a systemic influence that we cannot control and is most
|
|
|
-likely caused by the non-deterministic run-time behaviour of the operating
|
|
|
-system scheduler.
|
|
|
-
|
|
|
-
|
|
|
+the round-trip latency of the system. The round-trip latencies for data
|
|
|
+transfers to system main memory and GPU memory are shown in
|
|
|
+\figref{fig:latency}.
|
|
|
+
|
|
|
+When system main memory is used,
|
|
|
+latencies as low as 1.1 \textmu s are achieved with Setup 1 for a packet size
|
|
|
+of 1024 B. The higher latency and the dependance on size measured with Setup 2
|
|
|
+are caused by the slower PCIe x4 Gen1 link connecting the FPGA board to the system main memory.
|
|
|
+
|
|
|
+The same test was performed when transferring data inside GPU memory, but also
|
|
|
+in this case the notification is written to systen main memory. This approach
|
|
|
+was used because the latency introduced by OpenCL scheduling (\~ 100 \textmu
|
|
|
+s) does not allow for a direct measurement based only on DirectGMA
|
|
|
+communication. When connecting the devices to the same RC, as in Setup 2, a
|
|
|
+latency of 2 \textmu is achieved (limited by the latency to system main
|
|
|
+memory, as seen in \figref{fig:latency}.a. On the contrary, if the FPGA board
|
|
|
+and the GPU are connected to different RC as in Setup 1, the latency increases
|
|
|
+significantly. It must be noted that the low latencies measured with Setup 1
|
|
|
+for packet sizes below 1 kB seem to be due to a caching mechanism inside the
|
|
|
+PCIe switch, and it is not clear whether data has been successfully written
|
|
|
+into GPU memory when the notification is delivered to the CPU. This effect
|
|
|
+must be taken into account for future implementations as it could potentially
|
|
|
+lead to data corruption.
|
|
|
+
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
|
We developed a hardware and software solution that enables DMA transfers
|
|
@@ -387,14 +409,13 @@ system a cost-effective alternative to larger workstations.
|
|
|
We measured a round-trip latency of 1 \textmu s when transfering data between
|
|
|
the DMA engine with system main memory. We also assessed the applicability of
|
|
|
DirectGMA in low latency applications: preliminary results shows that
|
|
|
-latencies as low as 2 \textmu s can by achieved when writing data in the GPU
|
|
|
+latencies as low as 2 \textmu s can by achieved during data transfers to GPU
|
|
|
memory. However, at the time of writing this paper, the latency introduced by
|
|
|
-OpenCL scheduling is in the range of hundreds of \textmu s. Therefore,
|
|
|
-optimization of the GPU- DMA interfacing OpenCL code is ongoing with the help
|
|
|
-of technical support by AMD, in order to lift the current limitation and
|
|
|
-enable the use of our implementation in low latency applications. Moreover,
|
|
|
-measurements show that for low latency applications dedicated hardware must be
|
|
|
-used in order to achieve the best performance.
|
|
|
+OpenCL scheduling is in the range of hundreds of \textmu s. Optimization of
|
|
|
+the GPU-DMA interfacing OpenCL code is ongoing with the help of technical
|
|
|
+support by AMD, in order to lift the current limitation and enable the use of
|
|
|
+our implementation in low latency applications. Moreover, measurements show
|
|
|
+that dedicated hardware must be employed in low latency applications.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|
|
@@ -415,13 +436,9 @@ of hardware by different vendors. Further improvements are expected by
|
|
|
generalizing the transfer mechanism and include Infiniband support besides the
|
|
|
existing PCIe connection.
|
|
|
|
|
|
-%% Where do we get this values? Any reference?
|
|
|
-%This allows
|
|
|
-%speeds of up to 290 Gb/s and latencies as low as 0.5 \textmu s.
|
|
|
-
|
|
|
-Our goal is to develop a unique hybrid solution, based on commercial standards,
|
|
|
-that includes fast data transmission protocols and a high performance GPU
|
|
|
-computing framework.
|
|
|
+Our goal is to develop a unique hybrid solution,
|
|
|
+based on commercial standards, that includes fast data transmission protocols
|
|
|
+and a high performance GPU computing framework.
|
|
|
|
|
|
|
|
|
\acknowledgments
|