8 years ago · 4ca6c256dd
--- a/literature.bib
+++ b/literature.bib
@@ -1,11 +1,15 @@
 
				+@misc{cuda_doc,
			
 
				+    title = {GPUDirect - CUDA Toolkit documentation},
			
 
				+    url = {http://docs.nvidia.com/cuda/gpudirect-rdma/#axzz3rU8P2Jwg},
			
 
				+}
			
 
				+
			
 
				 @CONFERENCE{caselle,
			
 
				 author={Caselle, M. and al.},
			
 
				-title={Commissioning of an ultra-fast data acquisition system for coherent synchrotron radiation detection},
			
 
				+title={Commissioning of an ultra fast data acquisition system for coherent synchrotron radiation detection},
			
 
				 journal={IPAC 2014: Proceedings of the 5th International Particle Accelerator Conference},
			
 
				 year={2014},
			
 
				 pages={3497-3499},
			
 
				 url={http://www.scopus.com/inward/record.url?eid=2-s2.0-84928346423&partnerID=40&md5=5775c1bc623215f734e33d5e5f9b4a9a},
			
 
				-document_type={Conference Paper},
			
 
				 }
			
 
				 
			
 
				 @ARTICLE{ufo_camera, 
			
--- a/paper.tex
+++ b/paper.tex
@@ -45,12 +45,12 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 
				 level software to manage direct   memory transfers using AMD's DirectGMA
			
 
				 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
			
 
				 for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
			
 
				-the possibility of using DirectGMA in low latency systems: preliminary
			
 
				-measurements show a latency as low as 1 \textmu s for data transfers to GPU
			
 
				-memory. The additional latency introduced by OpenCL scheduling is the current
			
 
				-performance bottleneck.  Our implementation is suitable for real- time DAQ
			
 
				-system applications ranging from photon science and medical imaging to High
			
 
				-Energy Physics (HEP) systems.}
			
 
				+the possibility of using our architecture in low latency systems: preliminary
			
 
				+measurements show a round-trip latency as low as 1 \textmu s for data
			
 
				+transfers to system memory, while the additional latency introduced by OpenCL
			
 
				+scheduling is the current limitation for GPU based systems.  Our
			
 
				+implementation is suitable for real- time DAQ system applications ranging from
			
 
				+photon science and medical imaging to High Energy Physics (HEP) systems.}
			
 
				 
			
 
				 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
			
 
				 
			
@@ -280,13 +280,15 @@ Python.
 
				 \section{Results}
			
 
				 
			
 
				 We carried out performance measurements on two different setups, which are
			
 
				-described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
			
 
				-used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
			
 
				-3.0 slot, but the two devices were connected to different PCIe Root Complexes
			
 
				+described in table~\ref{table:setups}. In both setups, a Xilinx VC709
			
 
				+evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
			
 
				+into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
			
 
				 (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
			
 
				-Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
			
 
				-were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
			
 
				-data transfers, the software implementation is the one described
			
 
				+Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
			
 
				+were connected to the same RC, as opposed to Setup 1. As stated in the
			
 
				+NVIDIA's GPUDirect documentation, the devices must share the same RC to
			
 
				+achieve the best performance~\cite{cuda_doc}.  In case of FPGA-to-CPU data
			
 
				+transfers, the software implementation is the one described
			
 
				 in~\cite{rota2015dma}.
			
 
				 
			
 
				 \begin{table}[]
			
@@ -320,21 +322,22 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 
				 \label{fig:throughput}
			
 
				 \end{figure}
			
 
				 
			
 
				-The measured results for the pure data throughput is shown in
			
 
				-\figref{fig:throughput} for transfers to the system's main
			
 
				-memory as well as to the global memory. In the
			
 
				-case of FPGA-to-GPU data transfers, the double buffering solution was used.
			
 
				-As one can see, in both cases the write performance is primarily limited by
			
 
				-the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
			
 
				-approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
			
 
				-GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
			
 
				-throughput saturates earlier but the maximum throughput is 6.6 GB/s.
			
 
				-The different slope and saturation point is a direct consequence of the different handshaking implementation. 
			
 
				-
			
 
				+In order to evaluate the performance of the DMA engine, measurements of pure
			
 
				+data throughput were carried out using Setup 1. The results are shown in
			
 
				+\figref{fig:throughput} for transfers to the system's main memory as well as
			
 
				+to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
			
 
				+double buffering mechanism was used. As one can see, in both cases the write
			
 
				+performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
			
 
				+size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
			
 
				+the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
			
 
				+throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
			
 
				+and maximum performance depend on the different implementation of the
			
 
				+handshaking sequence between DMA engine and the hosts.
			
 
				 
			
 
				 %% --------------------------------------------------------------------------
			
 
				 \subsection{Latency}
			
 
				 
			
 
				+
			
 
				 \begin{figure}[t]
			
 
				   \centering
			
 
				   \begin{subfigure}[b]{.49\textwidth}
			
@@ -361,11 +364,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 
				 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
			
 
				 the round-trip latency of the system.
			
 
				 
			
 
				-The results of 1000 measurements of the round-trip latency are shown in
			
 
				-\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
			
 
				-standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
			
 
				-systemic influence that we cannot control and is most likely caused by the
			
 
				-non-deterministic run-time behaviour of the operating system scheduler.
			
 
				+The results of 1000 measurements of the round-trip latency using system memory
			
 
				+are shown in \figref{fig:latency-hist}. The latency for Setup 1 and Setup 2
			
 
				+are, respectively, XXX \textmu s and XXX \textmu s. The non-Gaussian
			
 
				+distribution indicates a systemic influence that we cannot control and is most
			
 
				+likely caused by the non-deterministic run-time behaviour of the operating
			
 
				+system scheduler.
			
 
				 
			
 
				 
			
 
				 \section{Conclusion and outlook}
			
@@ -379,17 +383,18 @@ data transfer. The measurements on a low-end system based on an Intel Atom CPU
 
				 showed no significant difference in throughput performance. Depending on the
			
 
				 application and computing requirements, this result makes smaller acquisition
			
 
				 system a cost-effective alternative to larger workstations.
			
 
				- 
			
 
				-We also evaluated the performance of DirectGMA technology for low latency
			
 
				-applications. Preliminary results indicate that latencies as low as 2 \textmu
			
 
				-s can be achieved in data transfer to system main memory. However, at the time
			
 
				-of writing this paper, the latency introduced by DirectGMA OpenCL functions is
			
 
				-in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
			
 
				-interfacing code is ongoing with the help of technical support by AMD, in
			
 
				-order to lift the limitation introduced by OpenCL scheduling and make our
			
 
				-implementation suitable for low latency applications. Moreover, as opposed to
			
 
				-the previous case, measurements show that for low latency applications
			
 
				-dedicated hardware must be used in order to achieve the best performance.
			
 
				+
			
 
				+We measured a round-trip latency of 1 \textmu s when transfering data between
			
 
				+the DMA engine with system main memory. We also assessed the applicability of
			
 
				+DirectGMA in low latency applications: preliminary results shows that
			
 
				+latencies as low as 2 \textmu s can by achieved when writing data in the GPU
			
 
				+memory.  However, at the time of writing this paper, the latency introduced by
			
 
				+OpenCL scheduling is in the range of hundreds of \textmu s. Therefore,
			
 
				+optimization of the GPU- DMA interfacing OpenCL code is ongoing with the help
			
 
				+of technical support by AMD, in order to lift the current limitation and
			
 
				+enable the use of our implementation in low latency applications. Moreover,
			
 
				+measurements show that for low latency applications dedicated hardware must be
			
 
				+used in order to achieve the best performance.
			
 
				 
			
 
				 In order to increase the total throughput, a custom FPGA evaluation board is
			
 
				 currently under development. The board mounts a Virtex-7 chip and features two
			
@@ -399,7 +404,7 @@ as a single x16 device by using an external PCIe switch. With two cores
 
				 operating in parallel, we foresee an increase in the data throughput by a
			
 
				 factor of 2 (as demonstrated in~\cite{rota2015dma}).
			
 
				 
			
 
				-The software solution that we proposed allows seamless multi-GPU processing of
			
 
				+The proposed software solution allows seamless multi-GPU processing of
			
 
				 the incoming data, due to the integration in our streamed computing framework.
			
 
				 This allows straightforward integration with different DAQ systems and
			
 
				 introduction of custom data processing algorithms.