8 years ago · 699fb33f50
--- a/paper.tex
+++ b/paper.tex
@@ -20,13 +20,11 @@
 
				 }
			
 
				 
			
 
				 \abstract{%
			
 
				-  %% Old
			
 
				-  % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
			
 
				-  % data links.}
			
 
				-  %proposal for new abstract, including why do we need GPUs
			
 
				   Modern physics experiments have reached multi-GB/s data rates.  Fast data
			
 
				   links and high performance computing stages are required for continuous
			
 
				   acquisition and processing. Because of their intrinsic parallelism and
			
 
				+  % I would remove the computing from here and leave "ideal solution",
			
 
				+  % afterwards we have again computing...
			
 
				   computational power, GPUs emerged as an ideal computing solution for high
			
 
				   performance computing applications. To connect a fast data acquisition stage
			
 
				   with a GPU's processing power, we developed an architecture consisting of a
			
@@ -75,6 +73,7 @@ host computer. Due to its high bandwidth and modularity, PCIe quickly became the
 
				 commercial standard for connecting high-throughput peripherals such as GPUs or
			
 
				 solid state disks.  Optical PCIe networks have been demonstrated
			
 
				 % JESUS: time span -> for, point in time -> since ...
			
 
				+% BUDDHA: Ok boss. I wanted to say "since 10 years ago...", is for ok?
			
 
				 for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
			
 
				 as a communication bus over long distances. In particular, in HEP DAQ systems,
			
 
				 optical links are preferred over electrical ones because of their superior
			
@@ -96,7 +95,6 @@ DMA data transfers are handled by dedicated hardware, which compared with
 
				 Programmed Input Output (PIO) access, offer lower latency and higher throughput
			
 
				 at the cost of higher system complexity.
			
 
				 
			
 
				-
			
 
				 \begin{figure}[t]
			
 
				   \centering
			
 
				   \includegraphics[width=1.0\textwidth]{figures/transf}
			
@@ -146,16 +144,21 @@ address is 2 GB.
 
				 On the host side, AMD's DirectGMA technology, an implementation of the
			
 
				 bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
			
 
				 the FPGA to GPU memory and from the GPU to the FPGA's control registers.
			
 
				-\figref{fig:opencl-setup} illustrates the main mode of operation: To write into
			
 
				-the GPU, the physical bus address of the GPU buffer is determined with a call to
			
 
				+\figref{fig:opencl-setup} illustrates the main mode of operation: to write into
			
 
				+the GPU, the physical bus addresses of the GPU buffers are determined with a call to
			
 
				 \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
			
 
				 control register of the FPGA (1). The FPGA then writes data blocks autonomously
			
 
				-in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
			
 
				+in DMA fashion (2). 
			
 
				+% BUDDHA: This part is not true. We need to always do the handshaking if we transfer
			
 
				+% more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
			
 
				+Due to hardware restrictions the largest possible GPU buffer
			
 
				 sizes are about 95 MB but larger transfers can be achieved using a double
			
 
				 buffering mechanism. Because the GPU provides a flat memory address space and
			
 
				 our DMA engine allows multiple destination addresses to be set in advance, we
			
 
				 can determine all addresses before the actual transfers thus keeping the
			
 
				 CPU out of the transfer loop.
			
 
				+%% BUDDHA: the CPU is still involved in the loop at the moment. We didn't manage
			
 
				+% to move the handshaking completely to the GPU, did we?
			
 
				 
			
 
				 To signal events to the FPGA (4), the control registers can be mapped into the
			
 
				 GPU's address space passing a special AMD-specific flag and passing the physical
			
@@ -180,12 +183,14 @@ plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 
				 framework allows for an easy construction of streamed data processing on
			
 
				 heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
			
 
				 its specific format and run a Fourier transform on the GPU as well as writing
			
 
				-back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
			
 
				-fft ! write filename=out.raw} on the command line. The framework will take care
			
 
				-of scheduling the tasks and distribute the data items according.  A
			
 
				-complementary application programming interface allows users to develop custom
			
 
				-applications written in C or high-level languages such as Python. High
			
 
				-throughput is achieved by the combination of fine- and coarse-grained data
			
 
				+back the results to disk, one can run on the command line:
			
 
				+% BUDDHA: I like this point very very much, formatting helps to make it stand out
			
 
				+\begin{verbatim}ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
			
 
				+\end{verbatim} %%
			
 
				+The framework will take care of scheduling the tasks and distribute the data items 
			
 
				+according.  A complementary application programming interface allows users to 
			
 
				+develop custom applications written in C or high-level languages such as Python. 
			
 
				+High throughput is achieved by the combination of fine- and coarse-grained data
			
 
				 parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
			
 
				 of threads and by splitting the data stream and feeding individual data items to
			
 
				 separate GPUs. None of this requires any user intervention and is solely
			
@@ -215,6 +220,7 @@ strategy a viable solution for very large data transfers.
 
				     transfers required to fill the destination buffer. The throughput has been
			
 
				     estimated using the host side wall clock time. On-GPU data transfer is about
			
 
				     twice as fast.
			
 
				+    %% BUDDHA: forgive my ignorance: what does it mean "on-gpu"? 
			
 
				   }
			
 
				   \label{fig:intra-copy}
			
 
				 \end{figure}
			
@@ -222,6 +228,9 @@ strategy a viable solution for very large data transfers.
 
				 
			
 
				 \subsection{Throughput}
			
 
				 
			
 
				+%% BUDDHA: why do we need to state this thing? High-throughput affects also the
			
 
				+%% total latency. One can optimize for one or the other probably, but at the moment
			
 
				+%% we use the same approach, so I would not write this.
			
 
				 A high throughput is desired for applications in which the FPGA outputs large
			
 
				 amounts of data and timing is not an issue. This includes fast, high resolution
			
 
				 photon detectors as used in synchrotron facilities.
			
@@ -231,7 +240,7 @@ For both system and GPU memory, the write performance is primarily limited by
 
				 the PCIe bus. Higher payloads introduce less overhead, thus increasing the net
			
 
				 bandwidth. Up until 2 MB transfer size, the performance is almost the same,
			
 
				 after that the GPU transfer shows a slightly better slope. Data transfers larger
			
 
				-than 1 GB saturate the PCIe bus. \bf{LR: We should measure the slope for
			
 
				+than 1 GB saturate the PCIe bus. \textbf{LR: We should measure the slope for
			
 
				 different page sizes, I expect the saturation point to change for different
			
 
				 page sizes}
			
 
				 
			
@@ -250,6 +259,8 @@ page sizes}
 
				 
			
 
				 %% Change the specs for the small crate
			
 
				 % MV: we never did anything in that regard
			
 
				+% LR: Nicholas did, and he said there was no difference in FPGA-GPU
			
 
				+
			
 
				 % For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
			
 
				 % based on XXX and Intel Nano XXXX. The results does not show any significant difference
			
 
				 % compared to the previous setup, making it a more cost-effective solution.
			
@@ -263,20 +274,27 @@ page sizes}
 
				   \label{fig:intra-copy}
			
 
				 \end{figure}
			
 
				 
			
 
				-%% Latency here? What do we do?
			
 
				-%% We should add an histogram with 1000+ measurements of the latency to see if it's time-deterministic
			
 
				-%% Also: add a plot of latency vs different data sizes transmitted (from FPGA)
			
 
				+%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
			
 
				+
			
 
				+%% Here: instead of this useless plot, we can plot the latency vs different data sizes transmitted (from FPGA). It should reach 50% less for large data transfers, even with our current limitation... Maybe we can also try on a normal desktop ?
			
 
				 \begin{figure}
			
 
				   \centering
			
 
				   \includegraphics[width=0.6\textwidth]{figures/latency}
			
 
				   \caption{%
			
 
				-    The data transmission latency is decreased by XXX percent with respect to the traditional
			
 
				-    approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
			
 
				-    for a 4k pac
			
 
				+    For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b). 
			
 
				   }
			
 
				   \label{fig:latency}
			
 
				 \end{figure}
			
 
				 
			
 
				+% In case everything is fine.
			
 
				+\ref{fig:latency} shows the comparison between the traditional approach and GPU DMA data transfer.
			
 
				+The total latency is decreased 
			
 
				+The distribution of latency is shown in Figure \ref{fig:latency}.
			
 
				+
			
 
				+%% EMERGENCY TEXT if we don't manage to fix the latency problem
			
 
				+The round-trip time of a memory read request issued from the CPU to the FPGA is less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the execution of DirectGMA functions. 
			
 
				+
			
 
				+
			
 
				 
			
 
				 \section{Conclusion}
			
 
				 
			
@@ -290,10 +308,14 @@ integration of the DMA transfer setup in our streamed computing framework.
 
				 Integration with different DAQ systems and custom algorithms is therefore
			
 
				 immediate.
			
 
				 
			
 
				-
			
 
				 \subsection{Outlook}
			
 
				 
			
 
				-Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
			
 
				+%Add if we cannot fix latency
			
 
				+An optimization of the OpenCL code in ongoing, with the help of AMD technical support.
			
 
				+With a better understanding of the hardware and software aspects of DirectGMA, we expect 
			
 
				+a significant improvement in latency performance.  
			
 
				+
			
 
				+Support for NVIDIA's GPUDirect technology is foreseen in the next months to
			
 
				 lift the limitation of one specific GPU vendor and direct performance comparison will be possible.
			
 
				 
			
 
				 A custom FPGA evaluation board is currently under development in order to
			
@@ -306,10 +328,11 @@ we foresee an increase in the data throughput by a factor of 2 (as demonstrated
 
				 \textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
			
 
				 A big house for all these love-lacking protocols.}
			
 
				 
			
 
				-It is our intention to add Infiniband support. I NEED TO READ
			
 
				-WHAT ARE THE ADVANTAGES VS PCIe.
			
 
				+It is our intention to add Infiniband support. 
			
 
				+\textbf{I NEED TO READ
			
 
				+WHAT ARE THE ADVANTAGES VS PCIe. Update: internet sucks in China.}
			
 
				 
			
 
				-\textbf{LR:Here comes the visionary Luigi...}
			
 
				+%LR:Here comes the visionary Luigi
			
 
				 Our goal is to develop a unique hybrid solution, based
			
 
				 on commercial standards, that includes fast data transmission protocols and a high performance
			
 
				 GPU computing framework.