8 years ago · 9898382e98
--- a/paper.tex
+++ b/paper.tex
@@ -44,11 +44,13 @@ architecture consists of a   Direct Memory Access (DMA) engine compatible with
 
				 the Xilinx PCI-Express core,   a Linux driver for register access, and high-
			
 
				 level software to manage direct   memory transfers using AMD's DirectGMA
			
 
				 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
			
 
				-for transfers to GPU memory and 6.6~GB/s to system memory.  We
			
 
				-also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s. 
			
 
				-Our implementation is suitable for real- time DAQ system applications ranging
			
 
				-from photon science and medical imaging to High Energy Physics (HEP) trigger
			
 
				-systems. }
			
 
				+for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
			
 
				+the possibility of using DirectGMA in low latency systems: preliminary
			
 
				+measurements show a latency as low as 1 \textmu s for data transfers to GPU
			
 
				+memory. The additional latency introduced by OpenCL scheduling is the current
			
 
				+performance bottleneck.  Our implementation is suitable for real- time DAQ
			
 
				+system applications ranging from photon science and medical imaging to High
			
 
				+Energy Physics (HEP) systems.}
			
 
				 
			
 
				 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
			
 
				 
			
@@ -75,11 +77,12 @@ continuous streaming mode to a computing stage. In order to collect data over
 
				 long observation times, the readout architecture and the computing stages must
			
 
				 be able to sustain high data rates.
			
 
				 
			
 
				-Recent years have also seen an increasing
			
 
				-interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
			
 
				-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
			
 
				-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
			
 
				-applications, latency becomes the most stringent requirement for , \emph{e.g.} in Low/High-level trigger systems.  
			
 
				+Recent years have also seen an increasing interest in GPU-based systems for
			
 
				+High Energy Physics (HEP)  (\emph{e.g.} ATLAS~\cite{atlas_gpu},
			
 
				+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
			
 
				+photon science experiments. In time-deterministic applications,\emph{e.g.} in
			
 
				+Low/High-level trigger systems, latency becomes the most stringent
			
 
				+requirement.
			
 
				 
			
 
				 Due to its high bandwidth and modularity, PCIe quickly became the commercial
			
 
				 standard for connecting high-throughput peripherals such as GPUs or solid
			
@@ -96,6 +99,7 @@ s, respectively.
 
				 
			
 
				 %LR: FPGA^2 it's the name of their thing... 
			
 
				 %MV: best idea in the world :)
			
 
				+%LR: Let's call ours FPGA^2_GPU
			
 
				 
			
 
				 When the FPGA is used as a master, a higher throughput can be achieved.  An
			
 
				 example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
			
@@ -183,12 +187,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
 
				   LUTRAM   & 56    & (0.03)           \\
			
 
				   FF       & 5437  & (0.63)           \\
			
 
				   BRAM     & 21    & (1.39)           \\
			
 
				-  % Resource & Utilization & Available & Utilization \% \\
			
 
				-  %   \midrule
			
 
				-  % LUT      & 5331        & 433200    & 1.23           \\
			
 
				-  % LUTRAM   & 56          & 174200    & 0.03           \\
			
 
				-  % FF       & 5437        & 866400    & 0.63           \\
			
 
				-  % BRAM     & 20.50       & 1470      & 1.39           \\
			
 
				     \bottomrule
			
 
				   \end{tabular}
			
 
				 }{%
			
@@ -198,16 +196,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
 
				 \end{floatrow}
			
 
				 \end{figure}
			
 
				 
			
 
				-
			
 
				-% \begin{figure}[tb]
			
 
				-%   \centering
			
 
				-%   \includegraphics[width=0.6\textwidth]{figures/fpga-arch}
			
 
				-%   \caption{%
			
 
				-%     Architecture of the DMA engine.
			
 
				-%   }
			
 
				-%   \label{fig:fpga-arch}
			
 
				-% \end{figure}
			
 
				-
			
 
				 The physical addresses of the host's memory buffers are stored into an internal
			
 
				 memory and are dynamically updated by the driver or user, allowing highly
			
 
				 efficient zero-copy data transfers. The maximum size associated with each
			
@@ -287,6 +275,8 @@ fashion. A complementary application programming interface allows users to
 
				 develop custom applications written in C or high-level languages such as
			
 
				 Python.
			
 
				 
			
 
				+
			
 
				+%% --------------------------------------------------------------------------
			
 
				 \section{Results}
			
 
				 
			
 
				 We carried out performance measurements on two different setups, which are