|
@@ -43,15 +43,15 @@ HEP experiments.}
|
|
|
GPU computing has become the main driving force for high-performance computing
|
|
|
due to an unprecedented parallelism and a low cost-benefit factor. GPU
|
|
|
acceleration has found its way into numerous applications, ranging from
|
|
|
-simulation to image processing. Recent years have also seen an increasing interest in GPU-based systems for HEP applications, which require a combination of high data rates, high computational power and low latency (e.g.: ATLAS[cite], CMS[cite], ALICE[\cite{alice_gpu}], Mu3e[cite] and PANDA[cite] high/low-level triggers). Moreover, the volumes of data produced in recent photon science facilities have become comparable to those traditionally associated with HEP.
|
|
|
+simulation to image processing. Recent years have also seen an increasing interest in GPU-based systems for HEP applications, which require a combination of high data rates, high computational power and low latency (e.g.: ATLAS[cite], CMS[cite], ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu} and PANDA[cite]). Moreover, the volumes of data produced in recent photon science facilities have become comparable to those traditionally associated with HEP.
|
|
|
|
|
|
-In such experiments data is acquired by one or more read-out boards and then transmitted to GPUs in short bursts or in a continuous streaming mode. With expected data rates of several Gbytes/s, the data transmission link between the read-out boards and the host system can constitute the performance bottleneck. In case of high-level trigger, low-latency become the most stringent constraint.
|
|
|
+In such experiments data is acquired by one or more read-out boards and then transmitted to GPUs in short bursts or in a continuous streaming mode. With expected data rates of several Gbytes/s, the data transmission link between the read-out boards and the host system can constitute the performance bottleneck. In case of High-Level Trigger applications, low-latency becomes the most stringent constraint.
|
|
|
|
|
|
-To address these problems we have developed a high-throuhgput and low-latency architecture that connects FPGA-based devices and external GPUs by PCIe data links.
|
|
|
+To address these problems we propose complete hardware-software stack architecture based on our own DMA design and integration of AMD's DirectGMA technology into our processing pipeline.
|
|
|
|
|
|
+%% move this part after
|
|
|
In order to fully saturate the PCIe bus bandwidth\footnote{Net
|
|
|
-bandwidth of 6.7 GB/s for PCIe 3.0 x8.} and decrease the total latency, we propose complete hardware-software stack architecture based on our own DMA design and integration of AMD's
|
|
|
-DirectGMA technology into our processing pipeline.
|
|
|
+bandwidth of 6.7 GB/s for PCIe 3.0 x8.}
|
|
|
|
|
|
\section{Basic concepts}
|
|
|
|
|
@@ -66,11 +66,15 @@ datagram.
|
|
|
|
|
|
\section{Architecture}
|
|
|
|
|
|
-\subsection{FPGA side}
|
|
|
+\subsection{FPGA readout board}
|
|
|
|
|
|
-Our implementation has been optimized in order to achieve the maximum data throughput and to minimize the FPGA resource utilization, while still maintaining the flexibility of a Scatter-Gather memory policy. The architecture of the DMA engine described in \cite{rota2015dma} has been
|
|
|
-extended to support the PCI-Express Gen3 Core \cite{xilinxgen3}. The implementation features two
|
|
|
-separate engines to handle large data transfers from/to the host.
|
|
|
+In a typical HEP data link scheme, FPGA boards connect front-end detectors with high-level computing stage. Optical links are preferred over electrical solutions because of high radiation hardness, low power consumption and high density.
|
|
|
+
|
|
|
+In our solution, PCI-Express has been chosen as data link between FPGA boards and external computing. Thanks to its high-bandwidth and modularity, it became the commercial standard for connecting high-performance peripherals, in particular CPUs and GPUs. Optical PCI-Express networks have been demonstrated~\cite{optical_pcie}, opening the possibility of being used in HEP experiments.
|
|
|
+
|
|
|
+
|
|
|
+\subsubsection{DMA engine}
|
|
|
+A Direct Memory Access (DMA) engine is needed in order to maximize the data throghput. We have developed a DMA architecture~\cite{rota2015dma} that achieves maximum data throughput while minimizing resource utilization and maintaining the flexibility of a Scatter-Gather memory policy.The engine is now compatible with the Xilinx PCI-Express Gen3 IP-Core~\cite{xilinxgen3} and supports DMA data transfers from/to the host.
|
|
|
|
|
|
With respect to the previous version based on PCI-Express Gen2, the PCI-Express IP-Core provided by Xilinx has undergone some modifications, which are reflected in our logic implementation:
|
|
|
|
|
@@ -134,10 +138,19 @@ the host side wall clock time. On-GPU data transfer is about twice as fast.}
|
|
|
\label{fig:intra-copy}
|
|
|
\end{figure}
|
|
|
|
|
|
-
|
|
|
\section{Conclusion}
|
|
|
|
|
|
-% Outlook
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+\section{Outlook}
|
|
|
+
|
|
|
+\subsection{Hi-flex Board}
|
|
|
+
|
|
|
+A custom FPGA is currently under development. Describe me please.
|
|
|
+Picture of Board.
|
|
|
+The board features a PCI-Express Gen3 x16 connection, with two PCI-Express x8 cores instantiated on the board. The board will be used in conjuction with a PEX chip, which allows to map the two cores as a single x16 device. We expect to achieve data throughputs up to 13 GB/s by using the dual-core architecture already used with the Gen2 version of the IP-Core~\cite{rota2015dma}.
|
|
|
|
|
|
\begin{itemize}
|
|
|
\item PCIe might be changed for InfiniBand which offers such and such
|
|
@@ -146,7 +159,7 @@ the host side wall clock time. On-GPU data transfer is about twice as fast.}
|
|
|
|
|
|
\acknowledgments
|
|
|
|
|
|
-UFO? KSETA?
|
|
|
+UFO? KSETA? Are you joking?
|
|
|
|
|
|
|
|
|
\bibliographystyle{JHEP}
|