8 years ago · 78c6c5e443
--- a/data/throughput.py
+++ b/data/throughput.py
@@ -1,15 +1,23 @@
 
				 import numpy as np
			
 
				 import matplotlib.pyplot as plt
			
 
				+from mpl_toolkits.axes_grid.axislines import Subplot
			
 
				+
			
 
				 
			
 
				 gpu_data = np.loadtxt('throughput.gpu')
			
 
				 cpu_data = np.loadtxt('throughput.cpu')
			
 
				 
			
 
				+fig = plt.figure(1, (8,4))
			
 
				+ax = Subplot(fig, 111)
			
 
				+fig.add_subplot(ax)
			
 
				+
			
 
				 plt.rc('font', **dict(family='serif'))
			
 
				 
			
 
				-plt.figure(figsize=(8, 4))
			
 
				+#plt.figure(figsize=(8, 4))
			
 
				 
			
 
				-plt.semilogx(gpu_data[:,0], gpu_data[:,1], '*-', color='#3b5b92', label='GPU')
			
 
				-plt.semilogx(cpu_data[:,0], cpu_data[:,1], 'o-', color='#d54d4d', label='CPU')
			
 
				+ax.semilogx(gpu_data[:,0], gpu_data[:,1], '*-', color='#3b5b92', label='GPU')
			
 
				+ax.semilogx(cpu_data[:,0], cpu_data[:,1], 'o-', color='#d54d4d', label='CPU')
			
 
				+ax.axis["right"].set_visible(False)
			
 
				+ax.axis["top"].set_visible(False)
			
 
				 plt.xticks([1e4,1e6,1e8,1e10])
			
 
				 plt.yticks([0,2000,4000,6000,8000])
			
 
				 
			
--- a/paper.tex
+++ b/paper.tex
@@ -112,16 +112,16 @@ links~\cite{nieto2015high}. Their system (as limited by the interconnect)
 
				 achieves an average throughput of 870 MB/s with 1 KB block transfers.
			
 
				 
			
 
				 In order to achieve the best performance in terms of latency and bandwidth, we
			
 
				-developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core.To
			
 
				+developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
			
 
				 process the data, we encapsulated the DMA setup and memory mapping in a plugin
			
 
				 for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
			
 
				 framework allows for an easy construction of streamed data processing on
			
 
				 heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
			
 
				 integration with NVIDIA's CUDA functions for GPUDirect technology is not
			
 
				 possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
			
 
				-direct FPGA-to-GPU communication into our processing pipeline. In this paper we
			
 
				-report the performance of our DMA engine for FPGA-to-CPU communication and some
			
 
				-preliminary measurements about DirectGMA's performance in low-latency
			
 
				+direct FPGA-to-GPU communication into our processing pipeline. In this paper
			
 
				+we report the performance of our DMA engine for FPGA-to-CPU communication and
			
 
				+some preliminary measurements about DirectGMA's performance in low-latency
			
 
				 applications.
			
 
				 
			
 
				 %% LR: this part -> OK
			
@@ -254,7 +254,7 @@ Due to hardware restrictions the largest possible GPU buffer sizes are about
 
				 mechanism. Because the GPU provides a flat memory address space and our DMA
			
 
				 engine allows multiple destination addresses to be set in advance, we can
			
 
				 determine all addresses before the actual transfers thus keeping the CPU out
			
 
				-of the transfer loop for data sizes less than 95 MB.
			
 
				+of the transfer loop for data sizes less than 95 MB. 
			
 
				 
			
 
				 %% Ufo Framework
			
 
				 To process the data, we encapsulated the DMA setup and memory mapping in a
			
@@ -318,7 +318,7 @@ in~\cite{rota2015dma}.
 
				 
			
 
				 \begin{table}[]
			
 
				 \centering
			
 
				-\caption{Hardware used for throughput and latency measurements}
			
 
				+\caption{Description of the measurement setup}
			
 
				 \label{table:setups}
			
 
				 \tabcolsep=0.11cm
			
 
				 \begin{tabular}{@{}llll@{}}
			
@@ -328,8 +328,8 @@ in~\cite{rota2015dma}.
 
				 CPU           & Intel Xeon E5-1630             & Intel Atom D525   \\
			
 
				 Chipset       & Intel C612                     & Intel ICH9R Express   \\
			
 
				 GPU           & AMD FirePro W9100              & AMD FirePro W9100   \\
			
 
				-PCIe link (FPGA-System memory)    & x8 Gen3                        & x4 Gen1     \\
			
 
				-PCIe Link (FPGA-GPU)    & x8 Gen3                        & x8 Gen3     \\
			
 
				+PCIe link: FPGA-System memory    & x8 Gen3                        & x4 Gen1     \\
			
 
				+PCIe link: FPGA-GPU    & x8 Gen3                        & x8 Gen3     \\
			
 
				   \bottomrule
			
 
				 \end{tabular}
			
 
				 
			
@@ -338,11 +338,7 @@ PCIe Link (FPGA-GPU)    & x8 Gen3                        & x8 Gen3     \\
 
				 
			
 
				 \subsection{Throughput}
			
 
				 
			
 
				-% We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
			
 
				-% system based on an Intel Atom CPU. The results showed no significant difference
			
 
				-% compared to the previous setup. Depending on the application and computing
			
 
				-% requirements, this result makes smaller acquisition system a cost-effective
			
 
				-% alternative to larger workstations.
			
 
				+
			
 
				 
			
 
				 \begin{figure}[t]
			
 
				   \includegraphics[width=0.85\textwidth]{figures/throughput}
			
@@ -358,7 +354,7 @@ The measured results for the pure data throughput is shown in
 
				 memory as well as to the global memory as explained in \ref{sec:host}. 
			
 
				 % Must ask Suren about this
			
 
				 
			
 
				-In the case of FPGA-to-GPU data transfers, the double buffering solution was
			
 
				+In the case of FPGA-to-GPU data transfers, the duoble buffering solution was
			
 
				 used. As one can see, in both cases the write performance is primarily limited
			
 
				 by the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU
			
 
				 is approaching slowly 100 MB/s. From there on, the throughput increases up to
			
@@ -438,14 +434,20 @@ We developed a hardware and software solution that enables DMA transfers
 
				 between FPGA-based readout systems and GPU computing clusters.
			
 
				 
			
 
				 The net throughput is primarily limited by the PCIe link, reaching 6.4 GB/s
			
 
				-for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU data transfer.
			
 
				-
			
 
				-We evaluated the performance of DirectGMA technology for low latency
			
 
				-applications. Measurements done with different setups show that dedicated
			
 
				-hardware is required in order to achieve the best performance. Moreover, it is possible to transfer up to 4kB of Optimization of
			
 
				-the GPU DMA interfacing code is ongoing with the help of technical support by
			
 
				-AMD. With a better understanding of the hardware and software aspects of
			
 
				-DirectGMA, we expect a significant improvement in the latency performance.
			
 
				+for a FPGA-to-GPU data transfer and 6.6 GB/s for a FPGA-to-CPU's main memory
			
 
				+data transfer. The measurements on a low-end system based on an Intel Atom CPU
			
 
				+showed no significant difference in throughput performance. Depending on the
			
 
				+application and computing requirements, this result makes smaller acquisition
			
 
				+system a cost-effective alternative to larger workstations.
			
 
				+ 
			
 
				+We also evaluated the performance of DirectGMA technology for low latency
			
 
				+applications. Preliminary results indicate that latencies as low as 2 \textmu
			
 
				+s can be achieved in  data transfer to GPU memory. As opposed to the previous
			
 
				+case, for latency applications measurements show that dedicated hardware is
			
 
				+required in order to achieve the best performance. Optimization of the GPU-DMA
			
 
				+interfacing code is ongoing with the help of technical support by AMD. With a
			
 
				+better understanding of the hardware and software aspects of DirectGMA, we
			
 
				+expect a significant improvement in the latency performance.
			
 
				 
			
 
				 In order to increase the total throughput, a custom FPGA evaluation board is
			
 
				 currently under development. The board mounts a Virtex-7 chip and features two