|
@@ -75,14 +75,12 @@ GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
|
|
|
back-end readout systems and then transmitted in short bursts or in a
|
|
|
continuous streaming mode to a computing stage. In order to collect data over
|
|
|
long observation times, the readout architecture and the computing stages must
|
|
|
-be able to sustain high data rates.
|
|
|
-
|
|
|
-Recent years have also seen an increasing interest in GPU-based systems for
|
|
|
-High Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
|
|
|
-ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
|
|
|
-photon science experiments. In time-deterministic applications,\emph{e.g.} in
|
|
|
-Low/High-level trigger systems, latency becomes the most stringent
|
|
|
-requirement.
|
|
|
+be able to sustain high data rates. Recent years have also seen an increasing
|
|
|
+interest in GPU-based systems for High Energy Physics (HEP) (\emph{e.g.}
|
|
|
+ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
|
|
|
+PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
|
|
|
+applications, such as Low/High-level trigger systems, latency becomes
|
|
|
+the most stringent requirement.
|
|
|
|
|
|
Due to its high bandwidth and modularity, PCIe quickly became the commercial
|
|
|
standard for connecting high-throughput peripherals such as GPUs or solid
|
|
@@ -90,29 +88,23 @@ state disks. Moreover, optical PCIe networks have been demonstrated a decade
|
|
|
ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
|
|
|
communication link over long distances.
|
|
|
|
|
|
-Several solutions for direct FPGA/GPU communication based on PCIe are reported
|
|
|
+Several solutions for direct FPGA-GPU communication based on PCIe are reported
|
|
|
in literature, and all of them are based on NVIDIA's GPUdirect technology. In
|
|
|
-the implementation of bittnerner and Ruf ~\cite{bittner} the GPU acts as
|
|
|
+the implementation of Bittnerner and Ruf ~\cite{bittner} the GPU acts as
|
|
|
master during an FPGA-to-GPU data transfer, reading data from the FPGA. This
|
|
|
solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
|
|
|
-s, respectively.
|
|
|
-
|
|
|
-%LR: FPGA^2 it's the name of their thing...
|
|
|
-%MV: best idea in the world :)
|
|
|
-%LR: Let's call ours FPGA^2_GPU
|
|
|
-
|
|
|
-When the FPGA is used as a master, a higher throughput can be achieved. An
|
|
|
-example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
|
|
|
-et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
|
|
|
-Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
|
|
|
-PCIe network interface card~\cite{lonardo2015nanet}. The Gbe link however
|
|
|
-limits the latency performance of the system to a few tens of \textmu s. If
|
|
|
-only the FPGA-to-GPU latency is considered, the measured values span between
|
|
|
-1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
|
|
|
-bandwidth saturates at 120 MB/s. Nieto et~al.\ presented a system based on a
|
|
|
-PXIexpress data link that makes use of four PCIe 1.0
|
|
|
-links~\cite{nieto2015high}. Their system (as limited by the interconnect)
|
|
|
-achieves an average throughput of 870 MB/s with 1 KB block transfers.
|
|
|
+s, respectively. When the FPGA is used as a master, a higher throughput can be
|
|
|
+achieved. An example of this approach is the \emph{FPGA\textsuperscript{2}}
|
|
|
+framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x
|
|
|
+Gen2.0 data link. Lonardo et~al.\ achieved low latencies with their NaNet
|
|
|
+design, an FPGA-based PCIe network interface card~\cite{lonardo2015nanet}. The
|
|
|
+Gbe link however limits the latency performance of the system to a few tens of
|
|
|
+\textmu s. If only the FPGA-to-GPU latency is considered, the measured values
|
|
|
+span between 1~\textmu s and 6~\textmu s, depending on the datagram size.
|
|
|
+Nieto et~al.\ presented a system based on a PXIexpress data link that makes
|
|
|
+use of four PCIe 1.0 links~\cite{nieto2015high}. Their system (as limited by
|
|
|
+the interconnect) achieves an average throughput of 870 MB/s with 1 KB block
|
|
|
+transfers.
|
|
|
|
|
|
In order to achieve the best performance in terms of latency and bandwidth, we
|
|
|
developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
|
|
@@ -121,11 +113,11 @@ for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
|
|
|
integration with NVIDIA's CUDA functions for GPUDirect technology is not
|
|
|
-possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
|
|
|
-direct FPGA-to-GPU communication into our processing pipeline. In this paper
|
|
|
-we report the performance of our DMA engine for FPGA-to-CPU communication and
|
|
|
-some preliminary measurements about DirectGMA's performance in low-latency
|
|
|
-applications.
|
|
|
+possible at the moment. We therefore used AMD's DirectGMA technology to
|
|
|
+integrate direct FPGA-to-GPU communication into our processing pipeline. In
|
|
|
+this paper we report the throughput performance of our architecture together
|
|
|
+with some preliminary measurements about DirectGMA's applicability in low-
|
|
|
+latency applications.
|
|
|
|
|
|
%% LR: this part -> OK
|
|
|
\section{Architecture}
|
|
@@ -238,18 +230,19 @@ as bus master and pushes data to the FPGA.
|
|
|
|
|
|
%% Double Buffering strategy.
|
|
|
|
|
|
-Due to hardware restrictions the largest possible GPU buffer sizes are about 95
|
|
|
-MB but larger transfers can be achieved by using a double buffering mechanism.
|
|
|
-data are copied from the DirectGMA buffer exposed to the FPGA into a different
|
|
|
-GPU buffer. To verify that we can keep up with the incoming data throughput
|
|
|
-using this strategy, we measured the data throughput within a GPU by copying
|
|
|
-data from a smaller sized buffer representing the DMA buffer to a larger
|
|
|
-destination buffer. At a block size of about 384 KB the throughput surpasses the
|
|
|
-maximum possible PCIe bandwidth, and it reaches 40 GB/s for blocks bigger than 5
|
|
|
-MB. Double buffering is therefore a viable solution for very large data
|
|
|
-transfers, where throughput performance is favoured over latency. For data sizes
|
|
|
-less than 95 MB, we can determine all addresses before the actual transfers thus
|
|
|
-keeping the CPU out of the transfer loop.
|
|
|
+Due to hardware restrictions the largest possible GPU buffer sizes are about
|
|
|
+95 MB but larger transfers can be achieved by using a double buffering
|
|
|
+mechanism: data are copied from the buffer exposed to the FPGA into a
|
|
|
+different location in GPU memory. To verify that we can keep up with the
|
|
|
+incoming data throughput using this strategy, we measured the data throughput
|
|
|
+within a GPU by copying data from a smaller sized buffer representing the DMA
|
|
|
+buffer to a larger destination buffer. At a block size of about 384 KB the
|
|
|
+throughput surpasses the maximum possible PCIe bandwidth, and it reaches 40
|
|
|
+GB/s for blocks bigger than 5 MB. Double buffering is therefore a viable
|
|
|
+solution for very large data transfers, where throughput performance is
|
|
|
+favoured over latency. For data sizes less than 95 MB, we can determine all
|
|
|
+addresses before the actual transfers thus keeping the CPU out of the transfer
|
|
|
+loop.
|
|
|
|
|
|
%% Ufo Framework
|
|
|
To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
@@ -318,22 +311,24 @@ in~\cite{rota2015dma}.
|
|
|
\includegraphics[width=0.85\textwidth]{figures/throughput}
|
|
|
\caption{%
|
|
|
Measured throughput for data transfers from FPGA to main memory
|
|
|
- (CPU) and from FPGA to the global GPU memory (GPU).
|
|
|
+ (CPU) and from FPGA to the global GPU memory (GPU) using Setup 1.
|
|
|
}
|
|
|
\label{fig:throughput}
|
|
|
\end{figure}
|
|
|
|
|
|
-In order to evaluate the maximum performance of the DMA engine, measurements of pure
|
|
|
-data throughput were carried out using Setup 1. The results are shown in
|
|
|
-\figref{fig:throughput} for transfers to the system's main memory as well as
|
|
|
-to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
|
|
|
+In order to evaluate the maximum performance of the DMA engine, measurements
|
|
|
+of pure data throughput were carried out using Setup 1. The results are shown
|
|
|
+in \figref{fig:throughput} for transfers to the system's main memory as well
|
|
|
+as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
|
|
|
double buffering mechanism was used. As one can see, in both cases the write
|
|
|
performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
|
|
|
size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
|
|
|
the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
|
|
|
throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
|
|
|
and maximum performance depend on the different implementation of the
|
|
|
-handshaking sequence between DMA engine and the hosts.
|
|
|
+handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
|
|
|
+Gen1 link limits the throughput to system main memory to around 700 MB/s.
|
|
|
+However, transfers to GPU memory yielded the same results as Setup 1.
|
|
|
|
|
|
%% --------------------------------------------------------------------------
|
|
|
\subsection{Latency}
|
|
@@ -367,32 +362,31 @@ We conducted the following test in order to measure the latency introduced by th
|
|
|
3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
|
|
|
4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
|
|
|
|
|
|
-The correct ordering of the packets is assured by the PCIe protocol.
|
|
|
A counter on the FPGA measures the time interval between the \emph{start\_dma}
|
|
|
and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
|
|
|
-the round-trip latency of the system. The round-trip latencies for data
|
|
|
-transfers to system main memory and GPU memory are shown in
|
|
|
-\figref{fig:latency}.
|
|
|
-
|
|
|
-When system main memory is used,
|
|
|
-latencies as low as 1.1 \textmu s are achieved with Setup 1 for a packet size
|
|
|
-of 1024 B. The higher latency and the dependance on size measured with Setup 2
|
|
|
-are caused by the slower PCIe x4 Gen1 link connecting the FPGA board to the system main memory.
|
|
|
-
|
|
|
-The same test was performed when transferring data inside GPU memory, but also
|
|
|
-in this case the notification is written to systen main memory. This approach
|
|
|
-was used because the latency introduced by OpenCL scheduling (\~ 100 \textmu
|
|
|
-s) does not allow for a direct measurement based only on DirectGMA
|
|
|
-communication. When connecting the devices to the same RC, as in Setup 2, a
|
|
|
-latency of 2 \textmu is achieved (limited by the latency to system main
|
|
|
-memory, as seen in \figref{fig:latency}.a. On the contrary, if the FPGA board
|
|
|
-and the GPU are connected to different RC as in Setup 1, the latency increases
|
|
|
-significantly. It must be noted that the low latencies measured with Setup 1
|
|
|
-for packet sizes below 1 kB seem to be due to a caching mechanism inside the
|
|
|
-PCIe switch, and it is not clear whether data has been successfully written
|
|
|
-into GPU memory when the notification is delivered to the CPU. This effect
|
|
|
-must be taken into account for future implementations as it could potentially
|
|
|
-lead to data corruption.
|
|
|
+the round-trip latency of the system. The correct ordering of the packets is
|
|
|
+assured by the PCIe protocol. The measured round-trip latencies for data transfers to
|
|
|
+system main memory and GPU memory are reported in \figref{fig:latency}.
|
|
|
+
|
|
|
+When system main memory is used, latencies as low as 1.1 \textmu s are
|
|
|
+achieved with Setup 1 for a packet size of 1024 B. The higher latency and the
|
|
|
+dependance on size measured with Setup 2 are caused by the slower PCIe x4 Gen1
|
|
|
+link connecting the FPGA board to the system main memory.
|
|
|
+
|
|
|
+The same test was performed when transferring data inside GPU memory. Like in
|
|
|
+the previous case, the notification was written into systen main memory. This
|
|
|
+approach was used because the latency introduced by OpenCL scheduling in our
|
|
|
+implementation (\~ 100-200 \textmu s) did not allow a precise measurement
|
|
|
+based only on FPGA-GPU communication. When connecting the devices to the same
|
|
|
+RC, as in Setup 2, a latency of 2 \textmu is achieved (limited by the latency
|
|
|
+to system main memory, as seen in \figref{fig:latency}.a). On the contrary, if
|
|
|
+the FPGA board and the GPU are connected to different RC as in Setup 1, the
|
|
|
+latency increases significantly with packet size. It must be noted that the
|
|
|
+low latencies measured with Setup 1 for packet sizes below 1 kB seem to be due
|
|
|
+to a caching mechanism inside the PCIe switch, and it is not clear whether
|
|
|
+data has been successfully written into GPU memory when the notification is
|
|
|
+delivered to the CPU. This effect must be taken into account in future
|
|
|
+implementations as it could potentially lead to data corruption.
|
|
|
|
|
|
\section{Conclusion and outlook}
|
|
|
|