|
@@ -249,13 +249,20 @@ the FPGA on successful or failed evaluation of the data. Using the
|
|
|
entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
|
|
|
as bus master and pushes data to the FPGA.
|
|
|
|
|
|
-%% Double Buffering strategy. Removed figure.
|
|
|
-Due to hardware restrictions the largest possible GPU buffer sizes are about
|
|
|
-95 MB but larger transfers can be achieved by using a double buffering
|
|
|
-mechanism. Because the GPU provides a flat memory address space and our DMA
|
|
|
-engine allows multiple destination addresses to be set in advance, we can
|
|
|
-determine all addresses before the actual transfers thus keeping the CPU out
|
|
|
-of the transfer loop for data sizes less than 95 MB.
|
|
|
+%% Double Buffering strategy.
|
|
|
+
|
|
|
+Due to hardware restrictions the largest possible GPU buffer sizes are about 95
|
|
|
+MB but larger transfers can be achieved by using a double buffering mechanism.
|
|
|
+data are copied from the DirectGMA buffer exposed to the FPGA into a different
|
|
|
+GPU buffer. To verify that we can keep up with the incoming data throughput
|
|
|
+using this strategy, we measured the data throughput within a GPU by copying
|
|
|
+data from a smaller sized buffer representing the DMA buffer to a larger
|
|
|
+destination buffer. At a block size of about 384 KB the throughput surpasses the
|
|
|
+maximum possible PCIe bandwidth, and it reaches 40 GB/s for blocks bigger than 5
|
|
|
+MB. Double buffering is therefore a viable solution for very large data
|
|
|
+transfers, where throughput performance is favoured over latency. For data sizes
|
|
|
+less than 95 MB, we can determine all addresses before the actual transfers thus
|
|
|
+keeping the CPU out of the transfer loop.
|
|
|
|
|
|
%% Ufo Framework
|
|
|
To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
@@ -286,37 +293,13 @@ Python.
|
|
|
We carried out performance measurements on two different setups, which are
|
|
|
described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
|
|
|
used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
|
|
|
-3.0 slot.
|
|
|
-%LR: explain this root-complex shit here
|
|
|
- In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected
|
|
|
-to a Netstor NA255A external PCIe enclosure. In case of FPGA-to-CPU data
|
|
|
-transfers, the software implementation is the one described
|
|
|
+3.0 slot, but the two devices were connected to different PCIe Root Complexes
|
|
|
+(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
|
|
|
+Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
|
|
|
+were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
|
|
|
+data transfers, the software implementation is the one described
|
|
|
in~\cite{rota2015dma}.
|
|
|
|
|
|
-% \begin{table}[]
|
|
|
-% \centering
|
|
|
-% \caption{Resource utilization on a Virtex7 device X240VT}
|
|
|
-% \label{table:utilization}
|
|
|
-% \tabcolsep=0.11cm
|
|
|
-% \small
|
|
|
-% \begin{tabular}{@{}llll@{}}
|
|
|
-% \toprule
|
|
|
-% Resource & Utilization & Utilization \% \\
|
|
|
-% \midrule
|
|
|
-% LUT & 5331 & 1.23 \\
|
|
|
-% LUTRAM & 56 & 0.03 \\
|
|
|
-% FF & 5437 & 0.63 \\
|
|
|
-% BRAM & 20.50 & 1.39 \\
|
|
|
-% % Resource & Utilization & Available & Utilization \% \\
|
|
|
-% % \midrule
|
|
|
-% % LUT & 5331 & 433200 & 1.23 \\
|
|
|
-% % LUTRAM & 56 & 174200 & 0.03 \\
|
|
|
-% % FF & 5437 & 866400 & 0.63 \\
|
|
|
-% % BRAM & 20.50 & 1470 & 1.39 \\
|
|
|
-% \bottomrule
|
|
|
-% \end{tabular}
|
|
|
-% \end{table}
|
|
|
-
|
|
|
\begin{table}[]
|
|
|
\centering
|
|
|
\small
|
|
@@ -334,14 +317,11 @@ PCIe slot: System memory & x8 Gen3 (same RC) & x4 Gen1 (different RC) \\
|
|
|
PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
|
|
|
\bottomrule
|
|
|
\end{tabular}
|
|
|
-
|
|
|
\end{table}
|
|
|
|
|
|
-
|
|
|
+%% --------------------------------------------------------------------------
|
|
|
\subsection{Throughput}
|
|
|
|
|
|
-
|
|
|
-
|
|
|
\begin{figure}[t]
|
|
|
\includegraphics[width=0.85\textwidth]{figures/throughput}
|
|
|
\caption{%
|
|
@@ -352,48 +332,20 @@ PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
|
|
|
\end{figure}
|
|
|
|
|
|
The measured results for the pure data throughput is shown in
|
|
|
-\figref{fig:throughput} for transfers from the FPGA to the system's main
|
|
|
-memory as well as to the global memory as explained in \ref{sec:host}. In the
|
|
|
-case of FPGA-to-GPU data transfers, the double buffering solution was used:
|
|
|
-data are copied from the buffer exposed to FPGA into a different buffer.
|
|
|
+\figref{fig:throughput} for transfers to the system's main
|
|
|
+memory as well as to the global memory. In the
|
|
|
+case of FPGA-to-GPU data transfers, the double buffering solution was used.
|
|
|
As one can see, in both cases the write performance is primarily limited by
|
|
|
the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
|
|
|
approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
|
|
|
GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
|
|
|
throughput saturates earlier but the maximum throughput is 6.6 GB/s.
|
|
|
+The different slope and saturation point is a direct consequence of the different handshaking implementation.
|
|
|
|
|
|
|
|
|
-
|
|
|
-% \begin{figure}
|
|
|
-% \includegraphics[width=\textwidth]{figures/intra-copy}
|
|
|
-% \caption{%
|
|
|
-% Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
|
|
|
-% (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
|
|
|
-% performance for smaller block sizes is caused by the larger amount of
|
|
|
-% transfers required to fill the destination buffer. The throughput has been
|
|
|
-% estimated using the host side wall clock time. The raw GPU data transfer as
|
|
|
-% measured per event profiling is about twice as fast.
|
|
|
-% }
|
|
|
-% \label{fig:intra-copy}
|
|
|
-% \end{figure}
|
|
|
-
|
|
|
-In order to write more than the maximum possible transfer size of 95 MB, we
|
|
|
-repeatedly wrote to the same sized buffer which is not possible in a real-
|
|
|
-world application. As a solution, we motivated the use of multiple copies in
|
|
|
-Section \ref{sec:host}. To verify that we can keep up with the incoming data
|
|
|
-throughput using this strategy, we measured the data throughput within a GPU
|
|
|
-by copying data from a smaller sized buffer representing the DMA buffer to a
|
|
|
-larger destination buffer. At a block size of about 384 KB the throughput
|
|
|
-surpasses the maximum possible PCIe bandwidth, and it reaches 40 GB/s for
|
|
|
-blocks bigger than 5 MB. Double buffering is therefore a viable solution for
|
|
|
-very large data transfers, where throughput performance is favoured over
|
|
|
-latency.
|
|
|
-
|
|
|
-% \figref{fig:intra-copy} shows the measured throughput for
|
|
|
-% three sizes and an increasing block size.
|
|
|
-
|
|
|
-
|
|
|
+%% --------------------------------------------------------------------------
|
|
|
\subsection{Latency}
|
|
|
+
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
|
\begin{subfigure}[b]{.45\textwidth}
|
|
@@ -410,24 +362,25 @@ latency.
|
|
|
\label{fig:latency}
|
|
|
\end{figure}
|
|
|
|
|
|
-For HEP experiments, low latencies are necessary to react in a reasonable time
|
|
|
-frame. In order to measure the latency caused by the communication overhead we
|
|
|
-conducted the following protocol: 1) the host issues continuous data transfers
|
|
|
-of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
|
|
|
-\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
|
|
|
-input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
|
|
|
-engine thus pushing back the data to the GPU. 3) At some point, the host enables
|
|
|
-generation of data different from initial value which also starts an internal
|
|
|
-FPGA counter with 4 ns resolution. 4) When the generated data is received again
|
|
|
-at the FPGA, the counter is stopped. 5) The host program reads out the counter
|
|
|
-values and computes the round-trip latency. The distribution of 10000
|
|
|
-measurements of the one-way latency is shown in \figref{fig:latency-hist}.
|
|
|
-[\textbf{REWRITE THIS PART}] The GPU latency has a mean value of 84.38 \textmu s
|
|
|
-and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the CPU
|
|
|
-latency of 76.89 \textmu s that was measured using the same driver and measuring
|
|
|
-procedure. The non-Gaussian distribution with two distinct peaks indicates a
|
|
|
-systemic influence that we cannot control and is most likely caused by the
|
|
|
-non-deterministic run-time behaviour of the operating system scheduler.
|
|
|
+
|
|
|
+We conducted the following test in order to measure the latency introduced by the DMA engine :
|
|
|
+1) the host starts a DMA transfer by issuing the \emph{start\_dma} command
|
|
|
+2) the DMA engine transmits data into the system main memory
|
|
|
+3) the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory
|
|
|
+4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command
|
|
|
+
|
|
|
+A counter on the FPGA measures the time interval between the \emph{start\_dma}
|
|
|
+and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
|
|
|
+the round-trip latency of the system.
|
|
|
+
|
|
|
+The distribution of 10000 measurements of the one-way latency is shown in
|
|
|
+\figref{fig:latency-hist}. The GPU latency has a mean value of 84.38 \textmu
|
|
|
+s and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the
|
|
|
+CPU latency of 76.89 \textmu s that was measured using the same driver and
|
|
|
+measuring procedure. The non-Gaussian distribution with two distinct peaks
|
|
|
+indicates a systemic influence that we cannot control and is most likely
|
|
|
+caused by the non-deterministic run-time behaviour of the operating system
|
|
|
+scheduler.
|
|
|
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
@@ -444,11 +397,9 @@ system a cost-effective alternative to larger workstations.
|
|
|
We also evaluated the performance of DirectGMA technology for low latency
|
|
|
applications. Preliminary results indicate that latencies as low as 2 \textmu
|
|
|
s can be achieved in data transfer to GPU memory. As opposed to the previous
|
|
|
-case, for latency applications measurements show that dedicated hardware is
|
|
|
-required in order to achieve the best performance. Optimization of the GPU-DMA
|
|
|
-interfacing code is ongoing with the help of technical support by AMD. With a
|
|
|
-better understanding of the hardware and software aspects of DirectGMA, we
|
|
|
-expect a significant improvement in the latency performance.
|
|
|
+case, measurements show that, for latency applications, dedicated hardware
|
|
|
+must be used in order to achieve the best performance. Optimization of the
|
|
|
+GPU-DMA interfacing code is ongoing with the help of technical support by AMD.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|