8 years ago · c84d0d741f
--- a/paper.tex
+++ b/paper.tex
@@ -249,13 +249,20 @@ the FPGA on successful or failed evaluation of the data. Using the
 
				 entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
			
 
				 as bus master and pushes data to the FPGA.
			
 
				 
			
 
				-%% Double Buffering strategy. Removed figure. 
			
 
				-Due to hardware restrictions the largest possible GPU buffer sizes are about
			
 
				-95 MB but larger transfers can be achieved by using a double buffering
			
 
				-mechanism. Because the GPU provides a flat memory address space and our DMA
			
 
				-engine allows multiple destination addresses to be set in advance, we can
			
 
				-determine all addresses before the actual transfers thus keeping the CPU out
			
 
				-of the transfer loop for data sizes less than 95 MB. 
			
 
				+%% Double Buffering strategy. 
			
 
				+
			
 
				+Due to hardware restrictions the largest possible GPU buffer sizes are about 95
			
 
				+MB but larger transfers can be achieved by using a double buffering mechanism.
			
 
				+data are copied from the DirectGMA buffer exposed to the FPGA into a different
			
 
				+GPU buffer. To verify that we can keep up with the incoming data throughput
			
 
				+using this strategy, we measured the data throughput within a GPU by copying
			
 
				+data from a smaller sized buffer representing the DMA buffer to a larger
			
 
				+destination buffer. At a block size of about 384 KB the throughput surpasses the
			
 
				+maximum possible PCIe bandwidth, and it reaches 40 GB/s for blocks bigger than 5
			
 
				+MB. Double buffering is therefore a viable solution for very large data
			
 
				+transfers, where throughput performance is favoured over latency. For data sizes
			
 
				+less than 95 MB, we can determine all addresses before the actual transfers thus
			
 
				+keeping the CPU out of the transfer loop.
			
 
				 
			
 
				 %% Ufo Framework
			
 
				 To process the data, we encapsulated the DMA setup and memory mapping in a
			
@@ -286,37 +293,13 @@ Python.
 
				 We carried out performance measurements on two different setups, which are
			
 
				 described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
			
 
				 used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
			
 
				-3.0 slot. 
			
 
				-%LR: explain this root-complex shit here
			
 
				- In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected
			
 
				-to a Netstor NA255A external PCIe enclosure. In case of FPGA-to-CPU data
			
 
				-transfers, the software implementation is the one described
			
 
				+3.0 slot, but the two devices were connected to different PCIe Root Complexes
			
 
				+(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
			
 
				+Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
			
 
				+were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
			
 
				+data transfers, the software implementation is the one described
			
 
				 in~\cite{rota2015dma}.
			
 
				 
			
 
				-% \begin{table}[]
			
 
				-% \centering
			
 
				-% \caption{Resource utilization on a Virtex7 device X240VT}
			
 
				-% \label{table:utilization}
			
 
				-% \tabcolsep=0.11cm
			
 
				-% \small
			
 
				-% \begin{tabular}{@{}llll@{}}
			
 
				-%   \toprule
			
 
				-% Resource & Utilization & Utilization \% \\
			
 
				-%   \midrule
			
 
				-% LUT      & 5331        & 1.23           \\
			
 
				-% LUTRAM   & 56          & 0.03           \\
			
 
				-% FF       & 5437        & 0.63           \\
			
 
				-% BRAM     & 20.50       & 1.39           \\
			
 
				-% % Resource & Utilization & Available & Utilization \% \\
			
 
				-% %   \midrule
			
 
				-% % LUT      & 5331        & 433200    & 1.23           \\
			
 
				-% % LUTRAM   & 56          & 174200    & 0.03           \\
			
 
				-% % FF       & 5437        & 866400    & 0.63           \\
			
 
				-% % BRAM     & 20.50       & 1470      & 1.39           \\
			
 
				-%   \bottomrule
			
 
				-% \end{tabular}
			
 
				-% \end{table}
			
 
				-
			
 
				 \begin{table}[]
			
 
				 \centering
			
 
				 \small
			
@@ -334,14 +317,11 @@ PCIe slot: System memory    & x8 Gen3 (same RC) & x4 Gen1 (different RC)    \\
 
				 PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
			
 
				   \bottomrule
			
 
				 \end{tabular}
			
 
				-
			
 
				 \end{table}
			
 
				 
			
 
				-
			
 
				+%% --------------------------------------------------------------------------
			
 
				 \subsection{Throughput}
			
 
				 
			
 
				-
			
 
				-
			
 
				 \begin{figure}[t]
			
 
				   \includegraphics[width=0.85\textwidth]{figures/throughput}
			
 
				   \caption{%
			
@@ -352,48 +332,20 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 
				 \end{figure}
			
 
				 
			
 
				 The measured results for the pure data throughput is shown in
			
 
				-\figref{fig:throughput} for transfers from the FPGA to the system's main
			
 
				-memory as well as to the global memory as explained in \ref{sec:host}.  In the
			
 
				-case of FPGA-to-GPU data transfers, the double buffering solution was used:
			
 
				-data are copied from the buffer exposed to FPGA into a different buffer.
			
 
				+\figref{fig:throughput} for transfers to the system's main
			
 
				+memory as well as to the global memory. In the
			
 
				+case of FPGA-to-GPU data transfers, the double buffering solution was used.
			
 
				 As one can see, in both cases the write performance is primarily limited by
			
 
				 the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
			
 
				 approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
			
 
				 GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
			
 
				 throughput saturates earlier but the maximum throughput is 6.6 GB/s.
			
 
				+The different slope and saturation point is a direct consequence of the different handshaking implementation. 
			
 
				 
			
 
				 
			
 
				-
			
 
				-% \begin{figure}
			
 
				-%   \includegraphics[width=\textwidth]{figures/intra-copy}
			
 
				-%   \caption{%
			
 
				-%     Throughput in MB/s for an intra-GPU data transfer of smaller block sizes
			
 
				-%     (4KB -- 24 MB) into a larger destination buffer (32 MB -- 128 MB). The lower
			
 
				-%     performance for smaller block sizes is caused by the larger amount of
			
 
				-%     transfers required to fill the destination buffer. The throughput has been
			
 
				-%     estimated using the host side wall clock time. The raw GPU data transfer as
			
 
				-%     measured per event profiling is about twice as fast.
			
 
				-%   }
			
 
				-%   \label{fig:intra-copy}
			
 
				-% \end{figure}
			
 
				-
			
 
				-In order to write more than the maximum possible transfer size of 95 MB, we
			
 
				-repeatedly wrote to the same sized buffer which is not possible in a real-
			
 
				-world application. As a solution, we motivated the use of multiple copies in
			
 
				-Section \ref{sec:host}. To verify that we can keep up with the incoming data
			
 
				-throughput using this strategy, we measured the data throughput within a GPU
			
 
				-by copying data from a smaller sized buffer representing the DMA buffer to a
			
 
				-larger destination buffer. At a block size of about 384 KB the throughput
			
 
				-surpasses the maximum possible PCIe bandwidth, and it reaches 40 GB/s for
			
 
				-blocks bigger than 5 MB. Double buffering is therefore a viable solution for
			
 
				-very large data transfers, where throughput performance is favoured over
			
 
				-latency.
			
 
				-
			
 
				-% \figref{fig:intra-copy} shows the measured throughput for
			
 
				-% three sizes and an increasing block size.
			
 
				-
			
 
				-
			
 
				+%% --------------------------------------------------------------------------
			
 
				 \subsection{Latency}
			
 
				+
			
 
				 \begin{figure}[t]
			
 
				   \centering
			
 
				   \begin{subfigure}[b]{.45\textwidth}
			
@@ -410,24 +362,25 @@ latency.
 
				   \label{fig:latency}
			
 
				 \end{figure}
			
 
				 
			
 
				-For HEP experiments, low latencies are necessary to react in a reasonable time
			
 
				-frame. In order to measure the latency caused by the communication overhead we
			
 
				-conducted the following protocol: 1) the host issues continuous data transfers
			
 
				-of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
			
 
				-\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
			
 
				-input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
			
 
				-engine thus pushing back the data to the GPU. 3) At some point, the host enables
			
 
				-generation of data different from initial value which also starts an internal
			
 
				-FPGA counter with 4 ns resolution. 4) When the generated data is received again
			
 
				-at the FPGA, the counter is stopped. 5) The host program reads out the counter
			
 
				-values and computes the round-trip latency. The distribution of 10000
			
 
				-measurements of the one-way latency is shown in \figref{fig:latency-hist}.
			
 
				-[\textbf{REWRITE THIS PART}] The GPU latency has a mean value of 84.38 \textmu s
			
 
				-and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the CPU
			
 
				-latency of 76.89 \textmu s that was measured using the same driver and measuring
			
 
				-procedure. The non-Gaussian distribution with two distinct peaks indicates a
			
 
				-systemic influence that we cannot control and is most likely caused by the
			
 
				-non-deterministic run-time behaviour of the operating system scheduler.
			
 
				+
			
 
				+We conducted the following test in order to measure the latency introduced by the DMA engine : 
			
 
				+1) the host starts a DMA transfer by issuing the \emph{start\_dma} command
			
 
				+2) the DMA engine transmits data into the system main memory
			
 
				+3) the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory
			
 
				+4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command
			
 
				+
			
 
				+A counter on the FPGA measures the time interval between the \emph{start\_dma}
			
 
				+and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
			
 
				+the round-trip latency of the system.
			
 
				+
			
 
				+The distribution of 10000 measurements of the one-way latency is shown in
			
 
				+\figref{fig:latency-hist}.  The GPU latency has a mean value of 84.38 \textmu
			
 
				+s and a standard variation of 6.34 \textmu s. This is 9.73 \% slower than the
			
 
				+CPU latency of 76.89 \textmu s that was measured using the same driver and
			
 
				+measuring procedure. The non-Gaussian distribution with two distinct peaks
			
 
				+indicates a systemic influence that we cannot control and is most likely
			
 
				+caused by the non-deterministic run-time behaviour of the operating system
			
 
				+scheduler.
			
 
				 
			
 
				 \section{Conclusion and outlook}
			
 
				 
			
@@ -444,11 +397,9 @@ system a cost-effective alternative to larger workstations.
 
				 We also evaluated the performance of DirectGMA technology for low latency
			
 
				 applications. Preliminary results indicate that latencies as low as 2 \textmu
			
 
				 s can be achieved in  data transfer to GPU memory. As opposed to the previous
			
 
				-case, for latency applications measurements show that dedicated hardware is
			
 
				-required in order to achieve the best performance. Optimization of the GPU-DMA
			
 
				-interfacing code is ongoing with the help of technical support by AMD. With a
			
 
				-better understanding of the hardware and software aspects of DirectGMA, we
			
 
				-expect a significant improvement in the latency performance.
			
 
				+case, measurements show that, for latency applications, dedicated hardware
			
 
				+must be used in order to achieve the best performance. Optimization of the
			
 
				+GPU-DMA interfacing code is ongoing with the help of technical support by AMD.
			
 
				 
			
 
				 In order to increase the total throughput, a custom FPGA evaluation board is
			
 
				 currently under development. The board mounts a Virtex-7 chip and features two