Przeglądaj źródła

Improve latency paragraph

Matthias Vogelgesang 8 lat temu
rodzic
commit
31932012e8
1 zmienionych plików z 46 dodań i 36 usunięć
  1. 46 36
      paper.tex

+ 46 - 36
paper.tex

@@ -5,6 +5,7 @@
 \usepackage{ifthen}
 \usepackage{caption}
 \usepackage{subcaption}
+\usepackage{textcomp}
 
 \newboolean{draft}
 \setboolean{draft}{true}
@@ -93,7 +94,7 @@ integration~\cite{lonardo2015nanet}.  Due to its design, the bandwidth saturates
 at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
 a commercial PCIe engine.  Other solutions achieve higher throughput based on
 Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
-they do not provide support for direct FPGA-GPU communication.
+do not support direct FPGA-to-GPU communication.
 
 
 \section{Architecture}
@@ -155,7 +156,7 @@ the FPGA to GPU memory and from the GPU to the FPGA's control registers.
 the GPU, the physical bus addresses of the GPU buffers are determined with a call to
 \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
 control register of the FPGA (1). The FPGA then writes data blocks autonomously
-in DMA fashion (2). 
+in DMA fashion (2).
 % BUDDHA: This part is not true. We need to always do the handshaking if we transfer
 % more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
 % MV: stop assuming ... I am not saying that there is no handshaking involved.
@@ -251,17 +252,13 @@ there on, the throughput increases up to 6.4 GB/s when  PCIe bus saturation sets
 in at about 1 GB data size. The CPU throughput saturates earlier at about 30 MB
 but the maximum throughput is limited to about 6 GB/s losing about 6\% write
 performance.
-
-In order to write more than the maximum possible transfer size of 95 MB, we
-repeatedly wrote to the same sized buffer which is not possible in a real-world
-application. As a solution, we motivated the use of multiple copies in Section
-\ref{sec:host}. To verify that we can keep up with the incoming data throughput
-using this strategy, we measured intra-GPU data throughput by copying data from
-a smaller sized buffer representing the DMA buffer to a larger destination
-buffer.  \figref{fig:intra-copy} shows the measured throughput for three sizes
-and an increasing block size. At a block size of about 384 KB, the throughput
-surpasses the maximum possible PCIe bandwidth, thus making a double buffering
-strategy a viable solution for very large data transfers.
+%% Change the specs for the small crate
+% MV: who has these specs?
+We repeated the FPGA-to-GPU measurements on a low-end system based on XXX and
+Intel Nano XXXX with the results showing no significant difference compared to
+the previous setup. Depending on the application and computing requirements,
+this result makes smaller acquisition system a cost-effective alternative to
+larger workstations.
 
 \begin{figure}
   \includegraphics[width=\textwidth]{figures/intra-copy}
@@ -276,28 +273,40 @@ strategy a viable solution for very large data transfers.
   \label{fig:intra-copy}
 \end{figure}
 
+In order to write more than the maximum possible transfer size of 95 MB, we
+repeatedly wrote to the same sized buffer which is not possible in a real-world
+application. As a solution, we motivated the use of multiple copies in Section
+\ref{sec:host}. To verify that we can keep up with the incoming data throughput
+using this strategy, we measured the data throughput within a GPU by copying
+data from a smaller sized buffer representing the DMA buffer to a larger
+destination buffer. \figref{fig:intra-copy} shows the measured throughput for
+three sizes and an increasing block size. At a block size of about 384 KB, the
+throughput surpasses the maximum possible PCIe bandwidth, thus making a double
+buffering strategy a viable solution for very large data transfers.
+
 For HEP experiments, low latencies are necessary to react in a reasonable time
-frame.  he distribution of latency is shown in Figure \ref{fig:latency}.
-\figref{fig:latency} shows the one-way latency for 4 KB data transfers from FPGA
-to system and GPU memory. % Explain experiment setup in detail.
+frame. In order to measure the latency caused by the communication overhead we
+conducted the following protocol: 1) the host issues continuous data transfers
+of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
+\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
+input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
+engine thus pushing back the data back to the GPU. 3) At some point, the host
+enables generation of data different from initial value which also starts an
+internal FPGA counter with 4 ns resolution.  4) When the generated data is
+received again at the FPGA, the counter is stopped. 5) The host program reads
+out the counter values and computes the round-trip latency. The distribution of
+10000 measurements of the one-way latency is shown in \figref{fig:latency}. The
+GPU latency has a mean value of 168.76 \textmu s and a standard variation of
+12.68 \textmu s. This is 9.73 \% slower than the CPU latency of 153.79 \textmu s
+that was measured using the same driver and measuring procedure. The
+non-Gaussian distribution with two distinct peaks indicates a systemic influence
+that we cannot control and is most likely caused by the non-deterministic
+run-time behaviour of the operating system scheduler.
 
 % \textbf{LR: We should measure the slope for different page sizes, I expect the
 % saturation point to change for different page sizes, MV: if you want to do it
 % you are more than welcome ...}
 
-
-
-%% Change the specs for the small crate
-% MV: we never did anything in that regard
-% LR: Nicholas did, and he said there was no difference in FPGA-GPU
-
-% For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
-% based on XXX and Intel Nano XXXX. The results does not show any significant difference
-% compared to the previous setup, making it a more cost-effective solution.
-
-
-%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
-
 %% Here: instead of this useless plot, we can plot the latency vs different data
 %% sizes transmitted (from FPGA). It should reach 50% less for large data
 %% transfers, even with our current limitation... Maybe we can also try on a normal
@@ -308,7 +317,7 @@ to system and GPU memory. % Explain experiment setup in detail.
 %   \centering
 %   \includegraphics[width=0.6\textwidth]{figures/latency}
 %   \caption{%
-%     For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b). 
+%     For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
 %   }
 %   \label{fig:latency}
 % \end{figure}
@@ -320,10 +329,10 @@ to system and GPU memory. % Explain experiment setup in detail.
 % less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the
 % execution of DirectGMA functions.
 % LA: Is this from a reference or we meassure it?
- 
+
 % LA: This time that you are showing here does not correlate with the measurements we were taking.
 % this 1 us is the round time inside the FPGA for the memory read, totally dependent on the FPGA,
-% 
+%
 % The times we are plotting in FIG 5 are the round trips inside the GPU not the FPGA
 % Yesterday I took the same measurement with the System Memory (CPU) and the values are not that different,
 % you can see the file out_cpu.txt values clustering around 150 us (CPU) instead of 170 us (GPU)
@@ -355,14 +364,15 @@ possible.
 A custom FPGA evaluation board is currently under development in order to
 increase the total throughput. The board mounts a Virtex-7 chip and features 2
 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
-x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
-single x16 device by using an external PCIe switch. With two cores operating in parallel,
-we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
+x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
+a single x16 device by using an external PCIe switch. With two cores operating
+in parallel, we foresee an increase in the data throughput by a factor of 2 (as
+demonstrated in~\cite{rota2015dma}).
 
 \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
 A big house for all these love-lacking protocols.}
 
-It is our intention to add Infiniband support. 
+It is our intention to add Infiniband support.
 % Could you stop screaming? And for starters, you could have just done the research
 % yourself ...
 % \textbf{I NEED TO READ