|
@@ -34,23 +34,25 @@
|
|
|
E-mail: \email{lorenzo.rota@kit.edu}, \email{matthias.vogelgesang@kit.edu}
|
|
|
}
|
|
|
|
|
|
-\abstract{ Modern physics experiments have reached multi-GB/s data rates. Fast
|
|
|
-data links and high performance computing stages are required for continuous
|
|
|
-data acquisition and processing. Because of their intrinsic parallelism and
|
|
|
-computational power, GPUs emerged as an ideal solution to process this data in
|
|
|
-high performance computing applications. In this paper we present a high-
|
|
|
-throughput platform based on direct FPGA-GPU communication. The
|
|
|
-architecture consists of a Direct Memory Access (DMA) engine compatible with
|
|
|
-the Xilinx PCI-Express core, a Linux driver for register access, and high-
|
|
|
-level software to manage direct memory transfers using AMD's DirectGMA
|
|
|
-technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
|
|
|
-for transfers to GPU memory and 6.6~GB/s to system memory. We also assesed
|
|
|
-the possibility of using our architecture in low latency systems: preliminary
|
|
|
-measurements show a round-trip latency as low as 1 \textmu s for data
|
|
|
-transfers to system memory, while the additional latency introduced by OpenCL
|
|
|
-scheduling is the current limitation for GPU based systems. Our
|
|
|
-implementation is suitable for real- time DAQ system applications ranging from
|
|
|
-photon science and medical imaging to High Energy Physics (HEP) systems.}
|
|
|
+\abstract{%
|
|
|
+ Modern physics experiments have reached multi-GB/s data rates. Fast data links
|
|
|
+ and high performance computing stages are required for continuous data
|
|
|
+ acquisition and processing. Because of their intrinsic parallelism and
|
|
|
+ computational power, GPUs emerged as an ideal solution to process this data in
|
|
|
+ high performance computing applications. In this paper we present a high-
|
|
|
+ throughput platform based on direct FPGA-GPU communication. The architecture
|
|
|
+ consists of a Direct Memory Access (DMA) engine compatible with the Xilinx
|
|
|
+ PCI-Express core, a Linux driver for register access, and high- level software
|
|
|
+ to manage direct memory transfers using AMD's DirectGMA technology.
|
|
|
+ Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s for transfers
|
|
|
+ to GPU memory and 6.6~GB/s to system memory. We also assessed the possibility
|
|
|
+ of using the architecture in low latency systems: preliminary measurements
|
|
|
+ show a round-trip latency as low as 1 \textmu s for data transfers to system
|
|
|
+ memory, while the additional latency introduced by OpenCL scheduling is the
|
|
|
+ current limitation for GPU based systems. Our implementation is suitable for
|
|
|
+ real-time DAQ system applications ranging from photon science and medical
|
|
|
+ imaging to High Energy Physics (HEP) systems.
|
|
|
+}
|
|
|
|
|
|
\keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
|
|
|
|
|
@@ -69,69 +71,67 @@ due to an unprecedented parallelism and a low cost-benefit factor. GPU
|
|
|
acceleration has found its way into numerous applications, ranging from
|
|
|
simulation to image processing.
|
|
|
|
|
|
-The data rates of bio-imaging or beam-monitoring experiments running in
|
|
|
-current generation photon science facilities have reached tens of
|
|
|
-GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
|
|
|
-back-end readout systems and then transmitted in short bursts or in a
|
|
|
-continuous streaming mode to a computing stage. In order to collect data over
|
|
|
-long observation times, the readout architecture and the computing stages must
|
|
|
-be able to sustain high data rates. Recent years have also seen an increasing
|
|
|
-interest in GPU-based systems for High Energy Physics (HEP) (\emph{e.g.}
|
|
|
-ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
|
|
|
-PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
|
|
|
-applications, such as Low/High-level trigger systems, latency becomes
|
|
|
-the most stringent requirement.
|
|
|
-
|
|
|
-Due to its high bandwidth and modularity, PCIe quickly became the commercial
|
|
|
-standard for connecting high-throughput peripherals such as GPUs or solid
|
|
|
-state disks. Moreover, optical PCIe networks have been demonstrated a decade
|
|
|
+The data rates of bio-imaging or beam-monitoring experiments running in current
|
|
|
+generation photon science facilities have reached tens of GB/s~\cite{ufo_camera,
|
|
|
+caselle}. In a typical scenario, data are acquired by back-end readout systems
|
|
|
+and then transmitted in short bursts or continuously streamed to a computing
|
|
|
+stage. In order to collect data over long observation times, the readout
|
|
|
+architecture and the computing stages must be able to sustain high data rates.
|
|
|
+Recent years have also seen an increasing interest in GPU-based systems for High
|
|
|
+Energy Physics (HEP) (\emph{e.g.} ATLAS~\cite{atlas_gpu},
|
|
|
+ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and photon
|
|
|
+science experiments. In time-deterministic applications, such as Low/High-level
|
|
|
+trigger systems, latency becomes the most stringent requirement.
|
|
|
+
|
|
|
+Due to its high bandwidth and modularity, PCIe is the commercial \emph{de facto}
|
|
|
+standard for connecting high-throughput peripherals such as GPUs or solid state
|
|
|
+disks. Moreover, optical PCIe networks have been demonstrated a decade
|
|
|
ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
|
|
|
communication link over long distances.
|
|
|
|
|
|
-Several solutions for direct FPGA-GPU communication based on PCIe are reported
|
|
|
-in literature, and all of them are based on NVIDIA's GPUdirect technology. In
|
|
|
-the implementation of Bittnerner and Ruf ~\cite{bittner} the GPU acts as
|
|
|
-master during an FPGA-to-GPU data transfer, reading data from the FPGA. This
|
|
|
-solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
|
|
|
-s, respectively. When the FPGA is used as a master, a higher throughput can be
|
|
|
-achieved. An example of this approach is the \emph{FPGA\textsuperscript{2}}
|
|
|
-framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x
|
|
|
-Gen2.0 data link. Lonardo et~al.\ achieved low latencies with their NaNet
|
|
|
-design, an FPGA-based PCIe network interface card~\cite{lonardo2015nanet}. The
|
|
|
-Gbe link however limits the latency performance of the system to a few tens of
|
|
|
-\textmu s. If only the FPGA-to-GPU latency is considered, the measured values
|
|
|
-span between 1~\textmu s and 6~\textmu s, depending on the datagram size.
|
|
|
-Nieto et~al.\ presented a system based on a PXIexpress data link that makes
|
|
|
-use of four PCIe 1.0 links~\cite{nieto2015high}. Their system (as limited by
|
|
|
-the interconnect) achieves an average throughput of 870 MB/s with 1 KB block
|
|
|
-transfers.
|
|
|
+Several solutions for direct FPGA-GPU communication based on PCIe and NVIDIA's
|
|
|
+proprietary GPUdirect technology are reported in the literature. In the
|
|
|
+implementation of Bittner and Ruf the GPU acts as master during an FPGA-to-GPU
|
|
|
+read data transfer \cite{bittner}. This solution limits the reported bandwidth
|
|
|
+and latency to 514 MB/s and 40~\textmu s, respectively. When the FPGA is used
|
|
|
+as a master, a higher throughput can be achieved. An example of this approach
|
|
|
+is the \emph{FPGA\textsuperscript{2}} framework by Thoma et~al.\cite{thoma},
|
|
|
+which reaches 2454 MB/s using a PCIe 2.0 8x data link. Lonardo et~al.\ achieved
|
|
|
+low latencies with their NaNet design, an FPGA-based PCIe network interface
|
|
|
+card~\cite{lonardo2015nanet}. The Gbe link however limits the latency
|
|
|
+performance of the system to a few tens of \textmu s. If only the FPGA-to-GPU
|
|
|
+latency is considered, the measured values span between 1~\textmu s and
|
|
|
+6~\textmu s, depending on the datagram size. Nieto et~al.\ presented a system
|
|
|
+based on a PXIexpress data link that makes use of four PCIe 1.0
|
|
|
+links~\cite{nieto2015high}. Their system, as limited by the interconnect,
|
|
|
+achieves an average throughput of 870 MB/s with 1 KB block transfers.
|
|
|
|
|
|
In order to achieve the best performance in terms of latency and bandwidth, we
|
|
|
-developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
|
|
|
+developed a high-performance DMA engine based on Xilinx's PCIe 3.0 Core. To
|
|
|
process the data, we encapsulated the DMA setup and memory mapping in a plugin
|
|
|
for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
|
|
|
integration with NVIDIA's CUDA functions for GPUDirect technology is not
|
|
|
possible at the moment. We therefore used AMD's DirectGMA technology to
|
|
|
-integrate direct FPGA-to-GPU communication into our processing pipeline. In
|
|
|
-this paper we report the throughput performance of our architecture together
|
|
|
-with some preliminary measurements about DirectGMA's applicability in low-
|
|
|
-latency applications.
|
|
|
+integrate direct FPGA-to-GPU communication into our processing pipeline. In this
|
|
|
+paper we present the hardware/software interface and report the throughput
|
|
|
+performance of our architecture together with preliminary measurements about
|
|
|
+DirectGMA's applicability in low-latency applications.
|
|
|
|
|
|
%% LR: this part -> OK
|
|
|
\section{Architecture}
|
|
|
|
|
|
As shown in \figref{fig:trad-vs-dgpu} (a), traditional FPGA-GPU systems route
|
|
|
-data through system main memory by copying data from the FPGA into
|
|
|
-intermediate buffers and then finally into the GPU's main memory. Thus, the
|
|
|
-total throughput and latency of the system is limited by the main memory
|
|
|
-bandwidth. NVIDIA's GPUDirect and AMD's DirectGMA technologies allow direct
|
|
|
-communication between GPUs and auxiliary devices over PCIe. By combining this
|
|
|
-technology with DMA data transfers (see \figref{fig:trad-vs-dgpu} (b)), the
|
|
|
-overall latency of the system is reduced and total throughput increased.
|
|
|
-Moreover, the CPU and main system memory are relieved from processing because
|
|
|
-they are not directly involved in the data transfer anymore.
|
|
|
+data through system main memory by copying data from the FPGA into intermediate
|
|
|
+buffers and then finally into the GPU's main memory. Thus, the total throughput
|
|
|
+and latency of the system is limited by the main memory bandwidth. NVIDIA's
|
|
|
+GPUDirect and AMD's DirectGMA technologies allow direct communication between
|
|
|
+GPUs and auxiliary devices over PCIe. By combining this technology with DMA data
|
|
|
+transfers as shown in \figref{fig:trad-vs-dgpu} (b), the overall latency of the
|
|
|
+system is reduced and total throughput increased. Moreover, the CPU and main
|
|
|
+system memory are relieved from processing because they are not directly
|
|
|
+involved in the data transfer anymore.
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
@@ -150,16 +150,14 @@ they are not directly involved in the data transfer anymore.
|
|
|
|
|
|
We have developed a DMA engine that minimizes resource utilization while
|
|
|
maintaining the flexibility of a Scatter-Gather memory
|
|
|
-policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}. The engine is compatible with the Xilinx PCIe
|
|
|
-Gen2/3 IP- Core~\cite{xilinxgen3} for Xilinx FPGA families 6 and 7. DMA data
|
|
|
-transfers to/from main system memory and GPU memory are supported. Two FIFOs,
|
|
|
-with a data width of 256 bits and operating at 250 MHz, act as user- friendly
|
|
|
-interfaces with the custom logic with an input bandwidth of 7.45 GB/s. The
|
|
|
-user logic and the DMA engine are configured by the host through PIO
|
|
|
-registers. The resource
|
|
|
-utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
|
|
|
-
|
|
|
-
|
|
|
+policy~\cite{rota2015dma}. The main blocks are shown in \figref{fig:fpga-arch}.
|
|
|
+The engine is compatible with the Xilinx PCIe 2.0/3.0 IP-Core~\cite{xilinxgen3}
|
|
|
+for Xilinx FPGA families 6 and 7. DMA data transfers between main system memory
|
|
|
+and GPU memory are supported. Two FIFOs,operating at 250 MHz and a data width of
|
|
|
+256 bits, act as user-friendly interfaces with the custom logic at an input
|
|
|
+bandwidth of 7.45 GB/s. The user logic and the DMA engine are configured by the
|
|
|
+host system through PIO registers. The resource utilization on a Virtex 7 device
|
|
|
+is reported in Table~\ref{table:utilization}.
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\small
|
|
@@ -182,7 +180,7 @@ utilization on a Virtex 7 device is reported in Table~\ref{table:utilization}.
|
|
|
\bottomrule
|
|
|
\end{tabular}
|
|
|
}{%
|
|
|
- \caption{Resource utilization on a xc7vx690t-ffg1761 device}%
|
|
|
+ \caption{Resource utilization on a xc7vx690t-ffg1761 device.}%
|
|
|
\label{table:utilization}
|
|
|
}
|
|
|
\end{floatrow}
|
|
@@ -210,63 +208,59 @@ address is 2 GB.
|
|
|
\end{figure}
|
|
|
|
|
|
%% Description of figure
|
|
|
-On the host side, AMD's DirectGMA technology, an implementation of the bus-
|
|
|
-addressable memory extension for OpenCL 1.1 and later, is used to write from
|
|
|
+On the host side, AMD's DirectGMA technology, an implementation of the
|
|
|
+bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
|
|
|
the FPGA to GPU memory and from the GPU to the FPGA's control registers.
|
|
|
-\figref{fig:opencl-setup} illustrates the main mode of operation: to write
|
|
|
-into the GPU, the physical bus addresses of the GPU buffers are determined
|
|
|
-with a call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the
|
|
|
-host CPU in a control register of the FPGA (1). The FPGA then writes data
|
|
|
-blocks autonomously in DMA fashion (2). To signal events to the FPGA (4), the
|
|
|
-control registers can be mapped into the GPU's address space passing a special
|
|
|
-AMD-specific flag and passing the physical BAR address of the FPGA
|
|
|
-configuration memory to the \texttt{cl\-Create\-Buffer} function. From the
|
|
|
-GPU, this memory is seen transparently as regular GPU memory and can be
|
|
|
-written accordingly (3). In our setup, trigger registers are used to notify
|
|
|
-the FPGA on successful or failed evaluation of the data. Using the
|
|
|
-\texttt{cl\-Enqueue\-Copy\-Buffer} function call it is possible to write
|
|
|
-entire memory regions in DMA fashion to the FPGA. In this case, the GPU acts
|
|
|
-as bus master and pushes data to the FPGA.
|
|
|
+\figref{fig:opencl-setup} illustrates the main mode of operation: to write into
|
|
|
+the GPU, the physical bus addresses of the GPU buffers are determined with a
|
|
|
+call to \texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU
|
|
|
+in a control register of the FPGA (1). The FPGA then writes data blocks
|
|
|
+autonomously in DMA fashion (2). To signal events to the FPGA (4), the control
|
|
|
+registers can be mapped into the GPU's address space passing a special
|
|
|
+AMD-specific flag and passing the physical BAR address of the FPGA configuration
|
|
|
+memory to the \texttt{cl\-Create\-Buffer} function. From the GPU, this memory is
|
|
|
+seen transparently as regular GPU memory and can be written accordingly (3). In
|
|
|
+our setup, trigger registers are used to notify the FPGA on successful or failed
|
|
|
+evaluation of the data. Using the \texttt{cl\-Enqueue\-Copy\-Buffer} function
|
|
|
+call it is possible to write entire memory regions in DMA fashion to the FPGA.
|
|
|
+In this case, the GPU acts as bus master and pushes data to the FPGA.
|
|
|
|
|
|
%% Double Buffering strategy.
|
|
|
|
|
|
-Due to hardware restrictions the largest possible GPU buffer sizes are about
|
|
|
-95 MB but larger transfers can be achieved by using a double buffering
|
|
|
-mechanism: data are copied from the buffer exposed to the FPGA into a
|
|
|
-different location in GPU memory. To verify that we can keep up with the
|
|
|
-incoming data throughput using this strategy, we measured the data throughput
|
|
|
-within a GPU by copying data from a smaller sized buffer representing the DMA
|
|
|
-buffer to a larger destination buffer. At a block size of about 384 KB the
|
|
|
-throughput surpasses the maximum possible PCIe bandwidth, and it reaches 40
|
|
|
-GB/s for blocks bigger than 5 MB. Double buffering is therefore a viable
|
|
|
-solution for very large data transfers, where throughput performance is
|
|
|
-favoured over latency. For data sizes less than 95 MB, we can determine all
|
|
|
-addresses before the actual transfers thus keeping the CPU out of the transfer
|
|
|
-loop.
|
|
|
+Due to hardware restrictions with AMD W9100 FirePro cards, the largest possible
|
|
|
+GPU buffer sizes are about 95 MB. However, larger transfers can be achieved by
|
|
|
+using a double buffering mechanism: data are copied from the buffer exposed to
|
|
|
+the FPGA into a different location in GPU memory. To verify that we can keep up
|
|
|
+with the incoming data throughput using this strategy, we measured the data
|
|
|
+throughput within a GPU by copying data from a smaller sized buffer representing
|
|
|
+the DMA buffer to a larger destination buffer. At a block size of about 384 KB
|
|
|
+the throughput surpasses the maximum possible PCIe bandwidth. Block transfers
|
|
|
+larger than 5 MB saturate the bandwidth at 40 GB/s. Double buffering is
|
|
|
+therefore a viable solution for very large data transfers, where throughput
|
|
|
+performance is favoured over latency. For data sizes less than 95 MB, we can
|
|
|
+determine all addresses before the actual transfers thus keeping the CPU out of
|
|
|
+the transfer loop.
|
|
|
|
|
|
%% Ufo Framework
|
|
|
To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
|
plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}.
|
|
|
This framework allows for an easy construction of streamed data processing on
|
|
|
-heterogeneous multi-GPU systems. For example, to read data from the FPGA,
|
|
|
-decode from its specific data format and run a Fourier transform on the GPU as
|
|
|
-well as writing back the results to disk, one can run the following on the
|
|
|
-command line:
|
|
|
-
|
|
|
+heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
|
|
|
+from its specific data format and run a Fourier transform on the GPU as well as
|
|
|
+writing back the results to disk, one can run the following on the command line:
|
|
|
|
|
|
\begin{verbatim}
|
|
|
ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
|
|
|
\end{verbatim}
|
|
|
|
|
|
-The framework takes care of scheduling the tasks and distributing the data
|
|
|
-items to one or more GPUs. High throughput is achieved by the combination of
|
|
|
-fine- and coarse-grained data parallelism, \emph{i.e.} processing a single
|
|
|
-data item on a GPU using thousands of threads and by splitting the data stream
|
|
|
-and feeding individual data items to separate GPUs. None of this requires any
|
|
|
-user intervention and is solely determined by the framework in an automatized
|
|
|
+The framework takes care of scheduling the tasks and distributing the data items
|
|
|
+to one or more GPUs. High throughput is achieved by the combination of fine- and
|
|
|
+coarse-grained data parallelism, \emph{i.e.} processing a single data item on a
|
|
|
+GPU using thousands of threads and by splitting the data stream and feeding
|
|
|
+individual data items to separate GPUs. None of this requires any user
|
|
|
+intervention and is solely determined by the framework in an automatized
|
|
|
fashion. A complementary application programming interface allows users to
|
|
|
-develop custom applications written in C or high-level languages such as
|
|
|
-Python.
|
|
|
+develop custom applications in C or high-level languages such as Python.
|
|
|
|
|
|
|
|
|
%% --------------------------------------------------------------------------
|
|
@@ -276,7 +270,7 @@ Python.
|
|
|
\begin{table}[b]
|
|
|
\centering
|
|
|
\small
|
|
|
-\caption{Setups used for throughput and latency measurements}
|
|
|
+\caption{Setups used for throughput and latency measurements.}
|
|
|
\label{table:setups}
|
|
|
\tabcolsep=0.11cm
|
|
|
\begin{tabular}{@{}llll@{}}
|
|
@@ -293,18 +287,17 @@ PCIe slot: FPGA \& GPU & x8 Gen3 (different RC) & x8 Gen3 (same RC) \\
|
|
|
\end{table}
|
|
|
|
|
|
We carried out performance measurements on two different setups, which are
|
|
|
-described in table~\ref{table:setups}. In both setups, a Xilinx VC709
|
|
|
-evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
|
|
|
-into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
|
|
|
-(RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
|
|
|
-Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
|
|
|
-were connected to the same RC, as opposed to Setup 1. As stated in the
|
|
|
-NVIDIA's GPUDirect documentation, the devices must share the same RC to
|
|
|
-achieve the best performance~\cite{cuda_doc}. In case of FPGA-to-CPU data
|
|
|
-transfers, the software implementation is the one described
|
|
|
-in~\cite{rota2015dma}.
|
|
|
+described in table~\ref{table:setups}. In both setups, a Xilinx VC709 evaluation
|
|
|
+board was used. In Setup 1, the FPGA board and the GPU were plugged into a PCIe
|
|
|
+3.0 slot, but they were connected to different PCIe Root Complexes (RC). In
|
|
|
+Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a Netstor
|
|
|
+NA255A external PCIe enclosure. As opposed to Setup 1, both the FPGA board and
|
|
|
+the GPU were connected to the same RC. As stated in NVIDIA's GPUDirect
|
|
|
+documentation, the devices must share the same RC to achieve the best
|
|
|
+performance~\cite{cuda_doc}. In case of FPGA-to-CPU data transfers, the software
|
|
|
+implementation is the one described in~\cite{rota2015dma}.
|
|
|
+
|
|
|
|
|
|
-%% --------------------------------------------------------------------------
|
|
|
\subsection{Throughput}
|
|
|
|
|
|
\begin{figure}[t]
|
|
@@ -316,23 +309,22 @@ in~\cite{rota2015dma}.
|
|
|
\label{fig:throughput}
|
|
|
\end{figure}
|
|
|
|
|
|
-In order to evaluate the maximum performance of the DMA engine, measurements
|
|
|
-of pure data throughput were carried out using Setup 1. The results are shown
|
|
|
-in \figref{fig:throughput} for transfers to the system's main memory as well
|
|
|
-as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
|
|
|
-double buffering mechanism was used. As one can see, in both cases the write
|
|
|
+In order to evaluate the maximum performance of the DMA engine, measurements of
|
|
|
+pure data throughput were carried out using Setup 1. The results are shown in
|
|
|
+\figref{fig:throughput} for transfers to the system's main memory as well as to
|
|
|
+the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the double
|
|
|
+buffering mechanism was used. As one can see, in both cases the write
|
|
|
performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
|
|
|
-size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
|
|
|
+size, the throughput to the GPU is slowly approaching 100 MB/s. From there on,
|
|
|
the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
|
|
|
-throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
|
|
|
-and maximum performance depend on the different implementation of the
|
|
|
-handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
|
|
|
-Gen1 link limits the throughput to system main memory to around 700 MB/s.
|
|
|
-However, transfers to GPU memory yielded the same results as Setup 1.
|
|
|
+throughput saturates earlier at a maximum throughput of 6.6 GB/s. The slope and
|
|
|
+maximum performance depend on the different implementation of the handshaking
|
|
|
+sequence between DMA engine and the hosts. With Setup 2, the PCIe 1.0 link
|
|
|
+limits the throughput to system main memory to around 700 MB/s. However,
|
|
|
+transfers to GPU memory yielded the same results as Setup 1.
|
|
|
|
|
|
-%% --------------------------------------------------------------------------
|
|
|
-\subsection{Latency}
|
|
|
|
|
|
+\subsection{Latency}
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
@@ -342,52 +334,57 @@ However, transfers to GPU memory yielded the same results as Setup 1.
|
|
|
|
|
|
\label{fig:latency-cpu}
|
|
|
\vspace{-0.4\baselineskip}
|
|
|
- \caption{}
|
|
|
+ \caption{System memory transfer latency}
|
|
|
\end{subfigure}
|
|
|
\begin{subfigure}[b]{.49\textwidth}
|
|
|
\includegraphics[width=\textwidth]{figures/latency-gpu}
|
|
|
|
|
|
\label{fig:latency-gpu}
|
|
|
\vspace{-0.4\baselineskip}
|
|
|
- \caption{}
|
|
|
+ \caption{GPU memory transfer latency}
|
|
|
\end{subfigure}
|
|
|
- \caption{Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).}
|
|
|
+ \caption{%
|
|
|
+ Measured round-trip latency for data transfers to system main memory (a) and GPU memory (b).
|
|
|
+ }
|
|
|
\label{fig:latency}
|
|
|
\end{figure}
|
|
|
|
|
|
-
|
|
|
-We conducted the following test in order to measure the latency introduced by the DMA engine :
|
|
|
-1) the host starts a DMA transfer by issuing the \emph{start\_dma} command.
|
|
|
-2) the DMA engine transmits data into the system main memory.
|
|
|
-3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
|
|
|
-4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
|
|
|
+We conducted the following test in order to measure the latency introduced by the DMA engine:
|
|
|
+1) the host starts a DMA transfer by issuing the \emph{start\_dma} command,
|
|
|
+2) the DMA engine transmits data into the system main memory,
|
|
|
+3) when all the data has been transferred, the DMA engine notifies the host that
|
|
|
+new data is present by writing into a specific address in the system main
|
|
|
+memory,
|
|
|
+4) the host acknowledges that data has been received by issuing the \emph{stop\_dma} command.
|
|
|
|
|
|
A counter on the FPGA measures the time interval between the \emph{start\_dma}
|
|
|
-and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
|
|
|
-the round-trip latency of the system. The correct ordering of the packets is
|
|
|
-assured by the PCIe protocol. The measured round-trip latencies for data transfers to
|
|
|
-system main memory and GPU memory are reported in \figref{fig:latency}.
|
|
|
+and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring the
|
|
|
+round-trip latency of the system. The correct ordering of the packets is
|
|
|
+guaranteed by the PCIe protocol. The measured round-trip latencies for data
|
|
|
+transfers to system main memory and GPU memory are shown in
|
|
|
+\figref{fig:latency}.
|
|
|
|
|
|
-When system main memory is used, latencies as low as 1.1 \textmu s are
|
|
|
-achieved with Setup 1 for a packet size of 1024 B. The higher latency and the
|
|
|
-dependance on size measured with Setup 2 are caused by the slower PCIe x4 Gen1
|
|
|
-link connecting the FPGA board to the system main memory.
|
|
|
+With Setup 1 and system memory, latencies as low as 1.1 \textmu s can be
|
|
|
+achieved for a packet size of 1024 B. Higher latencies and a dependency on size
|
|
|
+measured with Setup 2 are caused by the slower PCIe x4 1.0 link connecting the
|
|
|
+FPGA board to the system main memory.
|
|
|
|
|
|
The same test was performed when transferring data inside GPU memory. Like in
|
|
|
-the previous case, the notification was written into systen main memory. This
|
|
|
-approach was used because the latency introduced by OpenCL scheduling in our
|
|
|
-implementation (\~ 100-200 \textmu s) did not allow a precise measurement
|
|
|
-based only on FPGA-GPU communication. When connecting the devices to the same
|
|
|
-RC, as in Setup 2, a latency of 2 \textmu is achieved (limited by the latency
|
|
|
-to system main memory, as seen in \figref{fig:latency}.a). On the contrary, if
|
|
|
-the FPGA board and the GPU are connected to different RC as in Setup 1, the
|
|
|
-latency increases significantly with packet size. It must be noted that the
|
|
|
-low latencies measured with Setup 1 for packet sizes below 1 kB seem to be due
|
|
|
-to a caching mechanism inside the PCIe switch, and it is not clear whether
|
|
|
-data has been successfully written into GPU memory when the notification is
|
|
|
-delivered to the CPU. This effect must be taken into account in future
|
|
|
-implementations as it could potentially lead to data corruption.
|
|
|
+the previous case, the notification was written into main memory. This approach
|
|
|
+was used because a latency of 100 to 200 \textmu s introduced by OpenCL
|
|
|
+scheduling did not allow a precise measurement based only on FPGA-to-GPU
|
|
|
+communication. When connecting the devices to the same RC, as in Setup 2, a
|
|
|
+latency of 2 \textmu s is achieved and limited by the latency to system main
|
|
|
+memory, as seen in \figref{fig:latency} (a). On the contrary, if the FPGA board
|
|
|
+and the GPU are connected to a different RC as in Setup 1, the latency increases
|
|
|
+significantly with packet size. It must be noted that the low latencies measured
|
|
|
+with Setup 1 for packet sizes below 1 kB seem to be due to a caching mechanism
|
|
|
+inside the PCIe switch, and it is not clear whether data has been successfully
|
|
|
+written into GPU memory when the notification is delivered to the CPU. This
|
|
|
+effect must be taken into account in future implementations as it could
|
|
|
+potentially lead to data corruption.
|
|
|
|
|
|
+
|
|
|
\section{Conclusion and outlook}
|
|
|
|
|
|
We developed a hardware and software solution that enables DMA transfers
|
|
@@ -400,39 +397,37 @@ showed no significant difference in throughput performance. Depending on the
|
|
|
application and computing requirements, this result makes smaller acquisition
|
|
|
system a cost-effective alternative to larger workstations.
|
|
|
|
|
|
-We measured a round-trip latency of 1 \textmu s when transfering data between
|
|
|
-the DMA engine with system main memory. We also assessed the applicability of
|
|
|
-DirectGMA in low latency applications: preliminary results shows that
|
|
|
-latencies as low as 2 \textmu s can by achieved during data transfers to GPU
|
|
|
-memory. However, at the time of writing this paper, the latency introduced by
|
|
|
-OpenCL scheduling is in the range of hundreds of \textmu s. Optimization of
|
|
|
-the GPU-DMA interfacing OpenCL code is ongoing with the help of technical
|
|
|
-support by AMD, in order to lift the current limitation and enable the use of
|
|
|
-our implementation in low latency applications. Moreover, measurements show
|
|
|
-that dedicated hardware must be employed in low latency applications.
|
|
|
+We measured a round-trip latency of 1 \textmu s when transferring data between
|
|
|
+the DMA engine and system memory. We also assessed the applicability of
|
|
|
+DirectGMA in low latency applications: preliminary results shows that latencies
|
|
|
+as low as 2 \textmu s can be achieved during data transfers to GPU memory.
|
|
|
+However, at the time of writing this paper, the latency introduced by OpenCL
|
|
|
+scheduling is in the range of hundreds of \textmu s. In order to lift this
|
|
|
+limitation and make our implementation useful in low-latency applications, we
|
|
|
+are currently optimizing the the GPU-DMA interfacing OpenCL code with the help
|
|
|
+of technical support by AMD. Moreover, measurements show that dedicated
|
|
|
+connecting hardware must be employed in low latency applications.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|
|
|
fully populated FMC connectors, a 119 Gb/s DDR memory interface and a PCIe x16
|
|
|
-Gen3 connection. Two x8 Gen3 cores, instantiated on the board, will be mapped
|
|
|
+3.0 connection. Two PCIe x8 3.0 cores, instantiated on the board, will be mapped
|
|
|
as a single x16 device by using an external PCIe switch. With two cores
|
|
|
operating in parallel, we foresee an increase in the data throughput by a
|
|
|
-factor of 2 (as demonstrated in~\cite{rota2015dma}).
|
|
|
-
|
|
|
-The proposed software solution allows seamless multi-GPU processing of
|
|
|
-the incoming data, due to the integration in our streamed computing framework.
|
|
|
-This allows straightforward integration with different DAQ systems and
|
|
|
-introduction of custom data processing algorithms.
|
|
|
-
|
|
|
-Support for NVIDIA's GPUDirect technology is also foreseen in the next months
|
|
|
-to lift the limitation of one specific GPU vendor and compare the performance
|
|
|
-of hardware by different vendors. Further improvements are expected by
|
|
|
-generalizing the transfer mechanism and include Infiniband support besides the
|
|
|
-existing PCIe connection.
|
|
|
-
|
|
|
-Our goal is to develop a unique hybrid solution,
|
|
|
-based on commercial standards, that includes fast data transmission protocols
|
|
|
-and a high performance GPU computing framework.
|
|
|
+factor of two as demonstrated in~\cite{rota2015dma}.
|
|
|
+
|
|
|
+The proposed software solution allows seamless multi-GPU processing of the
|
|
|
+incoming data, due to the integration in our streamed computing framework. This
|
|
|
+allows straightforward integration with different DAQ systems and introduction
|
|
|
+of custom data processing algorithms. Support for NVIDIA's GPUDirect technology
|
|
|
+is also foreseen in the next months to lift the limitation of one specific GPU
|
|
|
+vendor and compare the performance of hardware by different vendors. Further
|
|
|
+improvements are expected by generalizing the transfer mechanism and include
|
|
|
+InfiniBand support besides the existing PCIe connection.
|
|
|
+
|
|
|
+Our goal is to develop a unique hybrid solution, based on commercial standards,
|
|
|
+that includes fast data transmission protocols and a high performance GPU
|
|
|
+computing framework.
|
|
|
|
|
|
|
|
|
\acknowledgments
|