|
@@ -44,11 +44,13 @@ architecture consists of a Direct Memory Access (DMA) engine compatible with
|
|
|
the Xilinx PCI-Express core, a Linux driver for register access, and high-
|
|
|
level software to manage direct memory transfers using AMD's DirectGMA
|
|
|
technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
|
|
|
-for transfers to GPU memory and 6.6~GB/s to system memory. We
|
|
|
-also evaluated DirectGMA performance for low latency applications: preliminary measurements show a round-trip latency of 2 \textmu s for data transfers up to 4 kB. However, the latency introduced by the OpenCL scheduling is in the order of 100 \textmu s.
|
|
|
-Our implementation is suitable for real- time DAQ system applications ranging
|
|
|
-from photon science and medical imaging to High Energy Physics (HEP) trigger
|
|
|
-systems. }
|
|
|
+for transfers to GPU memory and 6.6~GB/s to system memory. We also assesed
|
|
|
+the possibility of using DirectGMA in low latency systems: preliminary
|
|
|
+measurements show a latency as low as 1 \textmu s for data transfers to GPU
|
|
|
+memory. The additional latency introduced by OpenCL scheduling is the current
|
|
|
+performance bottleneck. Our implementation is suitable for real- time DAQ
|
|
|
+system applications ranging from photon science and medical imaging to High
|
|
|
+Energy Physics (HEP) systems.}
|
|
|
|
|
|
\keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
|
|
|
|
|
@@ -75,11 +77,12 @@ continuous streaming mode to a computing stage. In order to collect data over
|
|
|
long observation times, the readout architecture and the computing stages must
|
|
|
be able to sustain high data rates.
|
|
|
|
|
|
-Recent years have also seen an increasing
|
|
|
-interest in GPU-based systems for High Energy Physics (HEP) (\emph{e.g.}
|
|
|
-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
|
|
|
-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
|
|
|
-applications, latency becomes the most stringent requirement for , \emph{e.g.} in Low/High-level trigger systems.
|
|
|
+Recent years have also seen an increasing interest in GPU-based systems for
|
|
|
+High Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
|
|
|
+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
|
|
|
+photon science experiments. In time-deterministic applications,\emph{e.g.} in
|
|
|
+Low/High-level trigger systems, latency becomes the most stringent
|
|
|
+requirement.
|
|
|
|
|
|
Due to its high bandwidth and modularity, PCIe quickly became the commercial
|
|
|
standard for connecting high-throughput peripherals such as GPUs or solid
|
|
@@ -96,6 +99,7 @@ s, respectively.
|
|
|
|
|
|
%LR: FPGA^2 it's the name of their thing...
|
|
|
%MV: best idea in the world :)
|
|
|
+%LR: Let's call ours FPGA^2_GPU
|
|
|
|
|
|
When the FPGA is used as a master, a higher throughput can be achieved. An
|
|
|
example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
|
|
@@ -183,12 +187,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
|
|
|
LUTRAM & 56 & (0.03) \\
|
|
|
FF & 5437 & (0.63) \\
|
|
|
BRAM & 21 & (1.39) \\
|
|
|
- % Resource & Utilization & Available & Utilization \% \\
|
|
|
- % \midrule
|
|
|
- % LUT & 5331 & 433200 & 1.23 \\
|
|
|
- % LUTRAM & 56 & 174200 & 0.03 \\
|
|
|
- % FF & 5437 & 866400 & 0.63 \\
|
|
|
- % BRAM & 20.50 & 1470 & 1.39 \\
|
|
|
\bottomrule
|
|
|
\end{tabular}
|
|
|
}{%
|
|
@@ -198,16 +196,6 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
|
|
|
\end{floatrow}
|
|
|
\end{figure}
|
|
|
|
|
|
-
|
|
|
-% \begin{figure}[tb]
|
|
|
-% \centering
|
|
|
-% \includegraphics[width=0.6\textwidth]{figures/fpga-arch}
|
|
|
-% \caption{%
|
|
|
-% Architecture of the DMA engine.
|
|
|
-% }
|
|
|
-% \label{fig:fpga-arch}
|
|
|
-% \end{figure}
|
|
|
-
|
|
|
The physical addresses of the host's memory buffers are stored into an internal
|
|
|
memory and are dynamically updated by the driver or user, allowing highly
|
|
|
efficient zero-copy data transfers. The maximum size associated with each
|
|
@@ -287,6 +275,8 @@ fashion. A complementary application programming interface allows users to
|
|
|
develop custom applications written in C or high-level languages such as
|
|
|
Python.
|
|
|
|
|
|
+
|
|
|
+%% --------------------------------------------------------------------------
|
|
|
\section{Results}
|
|
|
|
|
|
We carried out performance measurements on two different setups, which are
|