8 lat temu · 31932012e8
--- a/paper.tex
+++ b/paper.tex
@@ -5,6 +5,7 @@
 
				 \usepackage{ifthen}
			
 
				 \usepackage{caption}
			
 
				 \usepackage{subcaption}
			
 
				+\usepackage{textcomp}
			
 
				 
			
 
				 \newboolean{draft}
			
 
				 \setboolean{draft}{true}
			
@@ -93,7 +94,7 @@ integration~\cite{lonardo2015nanet}.  Due to its design, the bandwidth saturates
 
				 at 120 MB/s for a 1472 byte large UDP datagram. Moreover, the system is based on
			
 
				 a commercial PCIe engine.  Other solutions achieve higher throughput based on
			
 
				 Xilinx (CITE TWEPP DMA WURTT??) or Altera devices (CITENICHOLASPAPER TNS), but
			
 
				-they do not provide support for direct FPGA-GPU communication.
			
 
				+do not support direct FPGA-to-GPU communication.
			
 
				 
			
 
				 
			
 
				 \section{Architecture}
			
@@ -155,7 +156,7 @@ the FPGA to GPU memory and from the GPU to the FPGA's control registers.
 
				 the GPU, the physical bus addresses of the GPU buffers are determined with a call to
			
 
				 \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
			
 
				 control register of the FPGA (1). The FPGA then writes data blocks autonomously
			
 
				-in DMA fashion (2). 
			
 
				+in DMA fashion (2).
			
 
				 % BUDDHA: This part is not true. We need to always do the handshaking if we transfer
			
 
				 % more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
			
 
				 % MV: stop assuming ... I am not saying that there is no handshaking involved.
			
@@ -251,17 +252,13 @@ there on, the throughput increases up to 6.4 GB/s when  PCIe bus saturation sets
 
				 in at about 1 GB data size. The CPU throughput saturates earlier at about 30 MB
			
 
				 but the maximum throughput is limited to about 6 GB/s losing about 6\% write
			
 
				 performance.
			
 
				-
			
 
				-In order to write more than the maximum possible transfer size of 95 MB, we
			
 
				-repeatedly wrote to the same sized buffer which is not possible in a real-world
			
 
				-application. As a solution, we motivated the use of multiple copies in Section
			
 
				-\ref{sec:host}. To verify that we can keep up with the incoming data throughput
			
 
				-using this strategy, we measured intra-GPU data throughput by copying data from
			
 
				-a smaller sized buffer representing the DMA buffer to a larger destination
			
 
				-buffer.  \figref{fig:intra-copy} shows the measured throughput for three sizes
			
 
				-and an increasing block size. At a block size of about 384 KB, the throughput
			
 
				-surpasses the maximum possible PCIe bandwidth, thus making a double buffering
			
 
				-strategy a viable solution for very large data transfers.
			
 
				+%% Change the specs for the small crate
			
 
				+% MV: who has these specs?
			
 
				+We repeated the FPGA-to-GPU measurements on a low-end system based on XXX and
			
 
				+Intel Nano XXXX with the results showing no significant difference compared to
			
 
				+the previous setup. Depending on the application and computing requirements,
			
 
				+this result makes smaller acquisition system a cost-effective alternative to
			
 
				+larger workstations.
			
 
				 
			
 
				 \begin{figure}
			
 
				   \includegraphics[width=\textwidth]{figures/intra-copy}
			
@@ -276,28 +273,40 @@ strategy a viable solution for very large data transfers.
 
				   \label{fig:intra-copy}
			
 
				 \end{figure}
			
 
				 
			
 
				+In order to write more than the maximum possible transfer size of 95 MB, we
			
 
				+repeatedly wrote to the same sized buffer which is not possible in a real-world
			
 
				+application. As a solution, we motivated the use of multiple copies in Section
			
 
				+\ref{sec:host}. To verify that we can keep up with the incoming data throughput
			
 
				+using this strategy, we measured the data throughput within a GPU by copying
			
 
				+data from a smaller sized buffer representing the DMA buffer to a larger
			
 
				+destination buffer. \figref{fig:intra-copy} shows the measured throughput for
			
 
				+three sizes and an increasing block size. At a block size of about 384 KB, the
			
 
				+throughput surpasses the maximum possible PCIe bandwidth, thus making a double
			
 
				+buffering strategy a viable solution for very large data transfers.
			
 
				+
			
 
				 For HEP experiments, low latencies are necessary to react in a reasonable time
			
 
				-frame.  he distribution of latency is shown in Figure \ref{fig:latency}.
			
 
				-\figref{fig:latency} shows the one-way latency for 4 KB data transfers from FPGA
			
 
				-to system and GPU memory. % Explain experiment setup in detail.
			
 
				+frame. In order to measure the latency caused by the communication overhead we
			
 
				+conducted the following protocol: 1) the host issues continuous data transfers
			
 
				+of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
			
 
				+\texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
			
 
				+input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
			
 
				+engine thus pushing back the data back to the GPU. 3) At some point, the host
			
 
				+enables generation of data different from initial value which also starts an
			
 
				+internal FPGA counter with 4 ns resolution.  4) When the generated data is
			
 
				+received again at the FPGA, the counter is stopped. 5) The host program reads
			
 
				+out the counter values and computes the round-trip latency. The distribution of
			
 
				+10000 measurements of the one-way latency is shown in \figref{fig:latency}. The
			
 
				+GPU latency has a mean value of 168.76 \textmu s and a standard variation of
			
 
				+12.68 \textmu s. This is 9.73 \% slower than the CPU latency of 153.79 \textmu s
			
 
				+that was measured using the same driver and measuring procedure. The
			
 
				+non-Gaussian distribution with two distinct peaks indicates a systemic influence
			
 
				+that we cannot control and is most likely caused by the non-deterministic
			
 
				+run-time behaviour of the operating system scheduler.
			
 
				 
			
 
				 % \textbf{LR: We should measure the slope for different page sizes, I expect the
			
 
				 % saturation point to change for different page sizes, MV: if you want to do it
			
 
				 % you are more than welcome ...}
			
 
				 
			
 
				-
			
 
				-
			
 
				-%% Change the specs for the small crate
			
 
				-% MV: we never did anything in that regard
			
 
				-% LR: Nicholas did, and he said there was no difference in FPGA-GPU
			
 
				-
			
 
				-% For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
			
 
				-% based on XXX and Intel Nano XXXX. The results does not show any significant difference
			
 
				-% compared to the previous setup, making it a more cost-effective solution.
			
 
				-
			
 
				-
			
 
				-%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
			
 
				-
			
 
				 %% Here: instead of this useless plot, we can plot the latency vs different data
			
 
				 %% sizes transmitted (from FPGA). It should reach 50% less for large data
			
 
				 %% transfers, even with our current limitation... Maybe we can also try on a normal
			
@@ -308,7 +317,7 @@ to system and GPU memory. % Explain experiment setup in detail.
 
				 %   \centering
			
 
				 %   \includegraphics[width=0.6\textwidth]{figures/latency}
			
 
				 %   \caption{%
			
 
				-%     For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b). 
			
 
				+%     For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
			
 
				 %   }
			
 
				 %   \label{fig:latency}
			
 
				 % \end{figure}
			
@@ -320,10 +329,10 @@ to system and GPU memory. % Explain experiment setup in detail.
 
				 % less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the
			
 
				 % execution of DirectGMA functions.
			
 
				 % LA: Is this from a reference or we meassure it?
			
 
				- 
			
 
				+
			
 
				 % LA: This time that you are showing here does not correlate with the measurements we were taking.
			
 
				 % this 1 us is the round time inside the FPGA for the memory read, totally dependent on the FPGA,
			
 
				-% 
			
 
				+%
			
 
				 % The times we are plotting in FIG 5 are the round trips inside the GPU not the FPGA
			
 
				 % Yesterday I took the same measurement with the System Memory (CPU) and the values are not that different,
			
 
				 % you can see the file out_cpu.txt values clustering around 150 us (CPU) instead of 170 us (GPU)
			
@@ -355,14 +364,15 @@ possible.
 
				 A custom FPGA evaluation board is currently under development in order to
			
 
				 increase the total throughput. The board mounts a Virtex-7 chip and features 2
			
 
				 fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe Gen3
			
 
				-x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as a
			
 
				-single x16 device by using an external PCIe switch. With two cores operating in parallel,
			
 
				-we foresee an increase in the data throughput by a factor of 2 (as demonstrated in~\cite{rota2015dma}).
			
 
				+x16 connection. Two PCIe x8 cores, instantiated on the board, will be mapped as
			
 
				+a single x16 device by using an external PCIe switch. With two cores operating
			
 
				+in parallel, we foresee an increase in the data throughput by a factor of 2 (as
			
 
				+demonstrated in~\cite{rota2015dma}).
			
 
				 
			
 
				 \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
			
 
				 A big house for all these love-lacking protocols.}
			
 
				 
			
 
				-It is our intention to add Infiniband support. 
			
 
				+It is our intention to add Infiniband support.
			
 
				 % Could you stop screaming? And for starters, you could have just done the research
			
 
				 % yourself ...
			
 
				 % \textbf{I NEED TO READ