8 éve · 9898382e98
--- a/paper.tex
+++ b/paper.tex
@@ -44,11 +44,13 @@ architecture consists of a   Direct Memory Access (DMA) engine compatible with
 
															 the Xilinx PCI-Express core,   a Linux driver for register access, and high-
														
 
															 level software to manage direct   memory transfers using AMD's DirectGMA
														
 
															 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
														
 
															-for transfers to GPU memory and 6.6~GB/s to system memory.  We
														
 
															-also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s. 
														
 
															-Our implementation is suitable for real- time DAQ system applications ranging
														
 
															-from photon science and medical imaging to High Energy Physics (HEP) trigger
														
 
															-systems. }
														
 
															+for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
														
 
															+the possibility of using DirectGMA in low latency systems: preliminary
														
 
															+measurements show a latency as low as 1 \textmu s for data transfers to GPU
														
 
															+memory. The additional latency introduced by OpenCL scheduling is the current
														
 
															+performance bottleneck.  Our implementation is suitable for real- time DAQ
														
 
															+system applications ranging from photon science and medical imaging to High
														
 
															+Energy Physics (HEP) systems.}
														
 
															 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
														
@@ -75,11 +77,12 @@ continuous streaming mode to a computing stage. In order to collect data over
 
															 long observation times, the readout architecture and the computing stages must
														
 
															 be able to sustain high data rates.
														
 
															-Recent years have also seen an increasing
														
 
															-interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
														
 
															-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
														
 
															-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
														
 
															-applications, latency becomes the most stringent requirement for , \emph{e.g.} in Low/High-level trigger systems.  
														
 
															+Recent years have also seen an increasing interest in GPU-based systems for
														
 
															+High Energy Physics (HEP)  (\emph{e.g.} ATLAS~\cite{atlas_gpu},
														
 
															+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
														
 
															+photon science experiments. In time-deterministic applications,\emph{e.g.} in
														
 
															+Low/High-level trigger systems, latency becomes the most stringent
														
 
															+requirement.
														
 
															 Due to its high bandwidth and modularity, PCIe quickly became the commercial
														
 
															 standard for connecting high-throughput peripherals such as GPUs or solid
														
@@ -96,6 +99,7 @@ s, respectively.
 
															 %LR: FPGA^2 it's the name of their thing... 
														
 
															 %MV: best idea in the world :)
														
 
															+%LR: Let's call ours FPGA^2_GPU
														
 
															 When the FPGA is used as a master, a higher throughput can be achieved.  An
														
 
															 example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
														
@@ -183,12 +187,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
 
															   LUTRAM   & 56    & (0.03)           \\
														
 
															   FF       & 5437  & (0.63)           \\
														
 
															   BRAM     & 21    & (1.39)           \\
														
 
															-  % Resource & Utilization & Available & Utilization \% \\
														
 
															-  %   \midrule
														
 
															-  % LUT      & 5331        & 433200    & 1.23           \\
														
 
															-  % LUTRAM   & 56          & 174200    & 0.03           \\
														
 
															-  % FF       & 5437        & 866400    & 0.63           \\
														
 
															-  % BRAM     & 20.50       & 1470      & 1.39           \\
														
 
															     \bottomrule
														
 
															   \end{tabular}
														
 
															 }{%
														
@@ -198,16 +196,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
 
															 \end{floatrow}
														
 
															 \end{figure}
														
 
															-
														
 
															-% \begin{figure}[tb]
														
 
															-%   \centering
														
 
															-%   \includegraphics[width=0.6\textwidth]{figures/fpga-arch}
														
 
															-%   \caption{%
														
 
															-%     Architecture of the DMA engine.
														
 
															-%   }
														
 
															-%   \label{fig:fpga-arch}
														
 
															-% \end{figure}
														
 
															-
														
 
															 The physical addresses of the host's memory buffers are stored into an internal
														
 
															 memory and are dynamically updated by the driver or user, allowing highly
														
 
															 efficient zero-copy data transfers. The maximum size associated with each
														
@@ -287,6 +275,8 @@ fashion. A complementary application programming interface allows users to
 
															 develop custom applications written in C or high-level languages such as
														
 
															 Python.
														
 
															+
														
 
															+%% --------------------------------------------------------------------------
														
 
															 \section{Results}
														
 
															 We carried out performance measurements on two different setups, which are