lorenzo 8 éve
szülő
commit
9898382e98
1 módosított fájl, 16 hozzáadás és 26 törlés
  1. 16 26
      paper.tex

+ 16 - 26
paper.tex

@@ -44,11 +44,13 @@ architecture consists of a   Direct Memory Access (DMA) engine compatible with
 the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 level software to manage direct   memory transfers using AMD's DirectGMA
 level software to manage direct   memory transfers using AMD's DirectGMA
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
-for transfers to GPU memory and 6.6~GB/s to system memory.  We
-also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s. 
-Our implementation is suitable for real- time DAQ system applications ranging
-from photon science and medical imaging to High Energy Physics (HEP) trigger
-systems. }
+for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
+the possibility of using DirectGMA in low latency systems: preliminary
+measurements show a latency as low as 1 \textmu s for data transfers to GPU
+memory. The additional latency introduced by OpenCL scheduling is the current
+performance bottleneck.  Our implementation is suitable for real- time DAQ
+system applications ranging from photon science and medical imaging to High
+Energy Physics (HEP) systems.}
 
 
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 
 
@@ -75,11 +77,12 @@ continuous streaming mode to a computing stage. In order to collect data over
 long observation times, the readout architecture and the computing stages must
 long observation times, the readout architecture and the computing stages must
 be able to sustain high data rates.
 be able to sustain high data rates.
 
 
-Recent years have also seen an increasing
-interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
-applications, latency becomes the most stringent requirement for , \emph{e.g.} in Low/High-level trigger systems.  
+Recent years have also seen an increasing interest in GPU-based systems for
+High Energy Physics (HEP)  (\emph{e.g.} ATLAS~\cite{atlas_gpu},
+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
+photon science experiments. In time-deterministic applications,\emph{e.g.} in
+Low/High-level trigger systems, latency becomes the most stringent
+requirement.
 
 
 Due to its high bandwidth and modularity, PCIe quickly became the commercial
 Due to its high bandwidth and modularity, PCIe quickly became the commercial
 standard for connecting high-throughput peripherals such as GPUs or solid
 standard for connecting high-throughput peripherals such as GPUs or solid
@@ -96,6 +99,7 @@ s, respectively.
 
 
 %LR: FPGA^2 it's the name of their thing... 
 %LR: FPGA^2 it's the name of their thing... 
 %MV: best idea in the world :)
 %MV: best idea in the world :)
+%LR: Let's call ours FPGA^2_GPU
 
 
 When the FPGA is used as a master, a higher throughput can be achieved.  An
 When the FPGA is used as a master, a higher throughput can be achieved.  An
 example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
 example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
@@ -183,12 +187,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
   LUTRAM   & 56    & (0.03)           \\
   LUTRAM   & 56    & (0.03)           \\
   FF       & 5437  & (0.63)           \\
   FF       & 5437  & (0.63)           \\
   BRAM     & 21    & (1.39)           \\
   BRAM     & 21    & (1.39)           \\
-  % Resource & Utilization & Available & Utilization \% \\
-  %   \midrule
-  % LUT      & 5331        & 433200    & 1.23           \\
-  % LUTRAM   & 56          & 174200    & 0.03           \\
-  % FF       & 5437        & 866400    & 0.63           \\
-  % BRAM     & 20.50       & 1470      & 1.39           \\
     \bottomrule
     \bottomrule
   \end{tabular}
   \end{tabular}
 }{%
 }{%
@@ -198,16 +196,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
 \end{floatrow}
 \end{floatrow}
 \end{figure}
 \end{figure}
 
 
-
-% \begin{figure}[tb]
-%   \centering
-%   \includegraphics[width=0.6\textwidth]{figures/fpga-arch}
-%   \caption{%
-%     Architecture of the DMA engine.
-%   }
-%   \label{fig:fpga-arch}
-% \end{figure}
-
 The physical addresses of the host's memory buffers are stored into an internal
 The physical addresses of the host's memory buffers are stored into an internal
 memory and are dynamically updated by the driver or user, allowing highly
 memory and are dynamically updated by the driver or user, allowing highly
 efficient zero-copy data transfers. The maximum size associated with each
 efficient zero-copy data transfers. The maximum size associated with each
@@ -287,6 +275,8 @@ fashion. A complementary application programming interface allows users to
 develop custom applications written in C or high-level languages such as
 develop custom applications written in C or high-level languages such as
 Python.
 Python.
 
 
+
+%% --------------------------------------------------------------------------
 \section{Results}
 \section{Results}
 
 
 We carried out performance measurements on two different setups, which are
 We carried out performance measurements on two different setups, which are