|
@@ -5,6 +5,7 @@
|
|
|
\usepackage{ifthen}
|
|
|
\usepackage{caption}
|
|
|
\usepackage{subcaption}
|
|
|
+\usepackage{textcomp}
|
|
|
|
|
|
\newboolean{draft}
|
|
|
\setboolean{draft}{true}
|
|
@@ -93,7 +94,7 @@ integration~\cite{lonardo2015nanet}. Due to its design, the bandwidth saturates
|
|
|
at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
|
|
|
a commercial PCIe engine. Other solutions achieve higher throughput based on
|
|
|
Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
|
|
|
-they do not provide support for direct FPGA-GPU communication.
|
|
|
+do not support direct FPGA-to-GPU communication.
|
|
|
|
|
|
|
|
|
\section{Architecture}
|
|
@@ -155,7 +156,7 @@ the FPGA to GPU memory and from the GPU to the FPGA's control registers.
|
|
|
the GPU, the physical bus addresses of the GPU buffers are determined with a call to
|
|
|
\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
|
|
|
control register of the FPGA (1). The FPGA then writes data blocks autonomously
|
|
|
-in DMA fashion (2).
|
|
|
+in DMA fashion (2).
|
|
|
% BUDDHA: This part is not true. We need to always do the handshaking if we transfer
|
|
|
% more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
|
|
|
% MV: stop assuming ... I am not saying that there is no handshaking involved.
|
|
@@ -251,17 +252,13 @@ there on, the throughput increases up to 6.4 GB/s when PCIe bus saturation sets
|
|
|
in at about 1 GB data size. The CPU throughput saturates earlier at about 30 MB
|
|
|
but the maximum throughput is limited to about 6 GB/s losing about 6\% write
|
|
|
performance.
|
|
|
-
|
|
|
-In order to write more than the maximum possible transfer size of 95 MB, we
|
|
|
-repeatedly wrote to the same sized buffer which is not possible in a real-world
|
|
|
-application. As a solution, we motivated the use of multiple copies in Section
|
|
|
-\ref{sec:host}. To verify that we can keep up with the incoming data throughput
|
|
|
-using this strategy, we measured intra-GPU data throughput by copying data from
|
|
|
-a smaller sized buffer representing the DMA buffer to a larger destination
|
|
|
-buffer. \figref{fig:intra-copy} shows the measured throughput for three sizes
|
|
|
-and an increasing block size. At a block size of about 384 KB, the throughput
|
|
|
-surpasses the maximum possible PCIe bandwidth, thus making a double buffering
|
|
|
-strategy a viable solution for very large data transfers.
|
|
|
+%% Change the specs for the small crate
|
|
|
+% MV: who has these specs?
|
|
|
+We repeated the FPGA-to-GPU measurements on a low-end system based on XXX and
|
|
|
+Intel Nano XXXX with the results showing no significant difference compared to
|
|
|
+the previous setup. Depending on the application and computing requirements,
|
|
|
+this result makes smaller acquisition system a cost-effective alternative to
|
|
|
+larger workstations.
|
|
|
|
|
|
\begin{figure}
|
|
|
\includegraphics[width=\textwidth]{figures/intra-copy}
|
|
@@ -276,28 +273,40 @@ strategy a viable solution for very large data transfers.
|
|
|
\label{fig:intra-copy}
|
|
|
\end{figure}
|
|
|
|
|
|
+In order to write more than the maximum possible transfer size of 95 MB, we
|
|
|
+repeatedly wrote to the same sized buffer which is not possible in a real-world
|
|
|
+application. As a solution, we motivated the use of multiple copies in Section
|
|
|
+\ref{sec:host}. To verify that we can keep up with the incoming data throughput
|
|
|
+using this strategy, we measured the data throughput within a GPU by copying
|
|
|
+data from a smaller sized buffer representing the DMA buffer to a larger
|
|
|
+destination buffer. \figref{fig:intra-copy} shows the measured throughput for
|
|
|
+three sizes and an increasing block size. At a block size of about 384 KB, the
|
|
|
+throughput surpasses the maximum possible PCIe bandwidth, thus making a double
|
|
|
+buffering strategy a viable solution for very large data transfers.
|
|
|
+
|
|
|
For HEP experiments, low latencies are necessary to react in a reasonable time
|
|
|
-frame. he distribution of latency is shown in Figure \ref{fig:latency}.
|
|
|
-\figref{fig:latency} shows the one-way latency for 4 KB data transfers from FPGA
|
|
|
-to system and GPU memory. % Explain experiment setup in detail.
|
|
|
+frame. In order to measure the latency caused by the communication overhead we
|
|
|
+conducted the following protocol: 1) the host issues continuous data transfers
|
|
|
+of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
|
|
|
+\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
|
|
|
+input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
|
|
|
+engine thus pushing back the data back to the GPU. 3) At some point, the host
|
|
|
+enables generation of data different from initial value which also starts an
|
|
|
+internal FPGA counter with 4 ns resolution. 4) When the generated data is
|
|
|
+received again at the FPGA, the counter is stopped. 5) The host program reads
|
|
|
+out the counter values and computes the round-trip latency. The distribution of
|
|
|
+10000 measurements of the one-way latency is shown in \figref{fig:latency}. The
|
|
|
+GPU latency has a mean value of 168.76 \textmu s and a standard variation of
|
|
|
+12.68 \textmu s. This is 9.73 \% slower than the CPU latency of 153.79 \textmu s
|
|
|
+that was measured using the same driver and measuring procedure. The
|
|
|
+non-Gaussian distribution with two distinct peaks indicates a systemic influence
|
|
|
+that we cannot control and is most likely caused by the non-deterministic
|
|
|
+run-time behaviour of the operating system scheduler.
|
|
|
|
|
|
% \textbf{LR: We should measure the slope for different page sizes, I expect the
|
|
|
% saturation point to change for different page sizes, MV: if you want to do it
|
|
|
% you are more than welcome ...}
|
|
|
|
|
|
-
|
|
|
-
|
|
|
-%% Change the specs for the small crate
|
|
|
-% MV: we never did anything in that regard
|
|
|
-% LR: Nicholas did, and he said there was no difference in FPGA-GPU
|
|
|
-
|
|
|
-% For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
|
|
|
-% based on XXX and Intel Nano XXXX. The results does not show any significant difference
|
|
|
-% compared to the previous setup, making it a more cost-effective solution.
|
|
|
-
|
|
|
-
|
|
|
-%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
|
|
|
-
|
|
|
%% Here: instead of this useless plot, we can plot the latency vs different data
|
|
|
%% sizes transmitted (from FPGA). It should reach 50% less for large data
|
|
|
%% transfers, even with our current limitation... Maybe we can also try on a normal
|
|
@@ -308,7 +317,7 @@ to system and GPU memory. % Explain experiment setup in detail.
|
|
|
% \centering
|
|
|
% \includegraphics[width=0.6\textwidth]{figures/latency}
|
|
|
% \caption{%
|
|
|
-% For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
|
|
|
+% For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
|
|
|
% }
|
|
|
% \label{fig:latency}
|
|
|
% \end{figure}
|
|
@@ -320,10 +329,10 @@ to system and GPU memory. % Explain experiment setup in detail.
|
|
|
% less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the
|
|
|
% execution of DirectGMA functions.
|
|
|
% LA: Is this from a reference or we meassure it?
|
|
|
-
|
|
|
+
|
|
|
% LA: This time that you are showing here does not correlate with the measurements we were taking.
|
|
|
% this 1 us is the round time inside the FPGA for the memory read, totally dependent on the FPGA,
|
|
|
-%
|
|
|
+%
|
|
|
% The times we are plotting in FIG 5 are the round trips inside the GPU not the FPGA
|
|
|
% Yesterday I took the same measurement with the System Memory (CPU) and the values are not that different,
|
|
|
% you can see the file out_cpu.txt values clustering around 150 us (CPU) instead of 170 us (GPU)
|
|
@@ -355,14 +364,15 @@ possible.
|
|
|
A custom FPGA evaluation board is currently under development in order to
|
|
|
increase the total throughput. The board mounts a Virtex-7 chip and features 2
|
|
|
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
|
|
|
-x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
|
|
|
-single x16 device by using an external PCIe switch. With two cores operating in parallel,
|
|
|
-we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
|
|
|
+x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
|
|
|
+a single x16 device by using an external PCIe switch. With two cores operating
|
|
|
+in parallel, we foresee an increase in the data throughput by a factor of 2 (as
|
|
|
+demonstrated in~\cite{rota2015dma}).
|
|
|
|
|
|
\textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
|
|
|
A big house for all these love-lacking protocols.}
|
|
|
|
|
|
-It is our intention to add Infiniband support.
|
|
|
+It is our intention to add Infiniband support.
|
|
|
% Could you stop screaming? And for starters, you could have just done the research
|
|
|
% yourself ...
|
|
|
% \textbf{I NEED TO READ
|