|
@@ -112,16 +112,16 @@ links~\cite{nieto2015high}. Their system (as limited by the interconnect)
|
|
|
achieves an average throughput of 870 MB/s with 1 KB block transfers.
|
|
|
|
|
|
In order to achieve the best performance in terms of latency and bandwidth, we
|
|
|
-developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.To
|
|
|
+developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
|
|
|
process the data, we encapsulated the DMA setup and memory mapping in a plugin
|
|
|
for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
|
|
|
integration with NVIDIA's CUDA functions for GPUDirect technology is not
|
|
|
possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
|
|
|
-direct FPGA-to-GPU communication into our processing pipeline. In this paper we
|
|
|
-report the performance of our DMA engine for FPGA-to-CPU communication and some
|
|
|
-preliminary measurements about DirectGMA's performance in low-latency
|
|
|
+direct FPGA-to-GPU communication into our processing pipeline. In this paper
|
|
|
+we report the performance of our DMA engine for FPGA-to-CPU communication and
|
|
|
+some preliminary measurements about DirectGMA's performance in low-latency
|
|
|
applications.
|
|
|
|
|
|
%% LR: this part -> OK
|
|
@@ -254,7 +254,7 @@ Due to hardware restrictions the largest possible GPU buffer sizes are about
|
|
|
mechanism. Because the GPU provides a flat memory address space and our DMA
|
|
|
engine allows multiple destination addresses to be set in advance, we can
|
|
|
determine all addresses before the actual transfers thus keeping the CPU out
|
|
|
-of the transfer loop for data sizes less than 95 MB.
|
|
|
+of the transfer loop for data sizes less than 95 MB.
|
|
|
|
|
|
%% Ufo Framework
|
|
|
To process the data, we encapsulated the DMA setup and memory mapping in a
|
|
@@ -318,7 +318,7 @@ in~\cite{rota2015dma}.
|
|
|
|
|
|
\begin{table}[]
|
|
|
\centering
|
|
|
-\caption{Hardware used for throughput and latency measurements}
|
|
|
+\caption{Description of the measurement setup}
|
|
|
\label{table:setups}
|
|
|
\tabcolsep=0.11cm
|
|
|
\begin{tabular}{@{}llll@{}}
|
|
@@ -328,8 +328,8 @@ in~\cite{rota2015dma}.
|
|
|
CPU & Intel Xeon E5-1630 & Intel Atom D525 \\
|
|
|
Chipset & Intel C612 & Intel ICH9R Express \\
|
|
|
GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
|
|
|
-PCIe link (FPGA-System memory) & x8 Gen3 & x4 Gen1 \\
|
|
|
-PCIe Link (FPGA-GPU) & x8 Gen3 & x8 Gen3 \\
|
|
|
+PCIe link: FPGA-System memory & x8 Gen3 & x4 Gen1 \\
|
|
|
+PCIe link: FPGA-GPU & x8 Gen3 & x8 Gen3 \\
|
|
|
\bottomrule
|
|
|
\end{tabular}
|
|
|
|
|
@@ -338,11 +338,7 @@ PCIe Link (FPGA-GPU) & x8 Gen3 & x8 Gen3 \\
|
|
|
|
|
|
\subsection{Throughput}
|
|
|
|
|
|
-% We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
|
|
|
-% system based on an Intel Atom CPU. The results showed no significant difference
|
|
|
-% compared to the previous setup. Depending on the application and computing
|
|
|
-% requirements, this result makes smaller acquisition system a cost-effective
|
|
|
-% alternative to larger workstations.
|
|
|
+
|
|
|
|
|
|
\begin{figure}[t]
|
|
|
\includegraphics[width=0.85\textwidth]{figures/throughput}
|
|
@@ -358,7 +354,7 @@ The measured results for the pure data throughput is shown in
|
|
|
memory as well as to the global memory as explained in \ref{sec:host}.
|
|
|
% Must ask Suren about this
|
|
|
|
|
|
-In the case of FPGA-to-GPU data transfers, the double buffering solution was
|
|
|
+In the case of FPGA-to-GPU data transfers, the duoble buffering solution was
|
|
|
used. As one can see, in both cases the write performance is primarily limited
|
|
|
by the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU
|
|
|
is approaching slowly 100 MB/s. From there on, the throughput increases up to
|
|
@@ -438,14 +434,20 @@ We developed a hardware and software solution that enables DMA transfers
|
|
|
between FPGA-based readout systems and GPU computing clusters.
|
|
|
|
|
|
The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
|
|
|
-for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
|
|
|
-
|
|
|
-We evaluated the performance of DirectGMA technology for low latency
|
|
|
-applications. Measurements done with different setups show that dedicated
|
|
|
-hardware is required in order to achieve the best performance. Moreover, it is possible to transfer up to 4kB of Optimization of
|
|
|
-the GPU DMA interfacing code is ongoing with the help of technical support by
|
|
|
-AMD. With a better understanding of the hardware and software aspects of
|
|
|
-DirectGMA, we expect a significant improvement in the latency performance.
|
|
|
+for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
|
|
|
+data transfer. The measurements on a low-end system based on an Intel Atom CPU
|
|
|
+showed no significant difference in throughput performance. Depending on the
|
|
|
+application and computing requirements, this result makes smaller acquisition
|
|
|
+system a cost-effective alternative to larger workstations.
|
|
|
+
|
|
|
+We also evaluated the performance of DirectGMA technology for low latency
|
|
|
+applications. Preliminary results indicate that latencies as low as 2 \textmu
|
|
|
+s can be achieved in data transfer to GPU memory. As opposed to the previous
|
|
|
+case, for latency applications measurements show that dedicated hardware is
|
|
|
+required in order to achieve the best performance. Optimization of the GPU-DMA
|
|
|
+interfacing code is ongoing with the help of technical support by AMD. With a
|
|
|
+better understanding of the hardware and software aspects of DirectGMA, we
|
|
|
+expect a significant improvement in the latency performance.
|
|
|
|
|
|
In order to increase the total throughput, a custom FPGA evaluation board is
|
|
|
currently under development. The board mounts a Virtex-7 chip and features two
|