8 yıl önce · 4ca6c256dd
--- a/literature.bib
+++ b/literature.bib
@@ -1,11 +1,15 @@
 
															+@misc{cuda_doc,
														
 
															+    title = {GPUDirect - CUDA Toolkit documentation},
														
 
															+    url = {http://docs.nvidia.com/cuda/gpudirect-rdma/#axzz3rU8P2Jwg},
														
 
															+}
														
 
															+
														
 
															 @CONFERENCE{caselle,
														
 
															 author={Caselle, M. and al.},
														
 
															-title={Commissioning of an ultra-fast data acquisition system for coherent synchrotron radiation detection},
														
 
															+title={Commissioning of an ultra fast data acquisition system for coherent synchrotron radiation detection},
														
 
															 journal={IPAC 2014: Proceedings of the 5th International Particle Accelerator Conference},
														
 
															 year={2014},
														
 
															 pages={3497-3499},
														
 
															 url={http://www.scopus.com/inward/record.url?eid=2-s2.0-84928346423&partnerID=40&md5=5775c1bc623215f734e33d5e5f9b4a9a},
														
 
															-document_type={Conference Paper},
														
 
															 }
														
 
															 @ARTICLE{ufo_camera, 
														
--- a/paper.tex
+++ b/paper.tex
@@ -45,12 +45,12 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 
															 level software to manage direct   memory transfers using AMD's DirectGMA
														
 
															 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
														
 
															 for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
														
 
															-the possibility of using DirectGMA in low latency systems: preliminary
														
 
															-measurements show a latency as low as 1 \textmu s for data transfers to GPU
														
 
															-memory. The additional latency introduced by OpenCL scheduling is the current
														
 
															-performance bottleneck.  Our implementation is suitable for real- time DAQ
														
 
															-system applications ranging from photon science and medical imaging to High
														
 
															-Energy Physics (HEP) systems.}
														
 
															+the possibility of using our architecture in low latency systems: preliminary
														
 
															+measurements show a round-trip latency as low as 1 \textmu s for data
														
 
															+transfers to system memory, while the additional latency introduced by OpenCL
														
 
															+scheduling is the current limitation for GPU based systems.  Our
														
 
															+implementation is suitable for real- time DAQ system applications ranging from
														
 
															+photon science and medical imaging to High Energy Physics (HEP) systems.}
														
 
															 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
														
@@ -280,13 +280,15 @@ Python.
 
															 \section{Results}
														
 
															 We carried out performance measurements on two different setups, which are
														
 
															-described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
														
 
															-used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
														
 
															-3.0 slot, but the two devices were connected to different PCIe Root Complexes
														
 
															+described in table~\ref{table:setups}. In both setups, a Xilinx VC709
														
 
															+evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
														
 
															+into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
														
 
															 (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
														
 
															-Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
														
 
															-were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
														
 
															-data transfers, the software implementation is the one described
														
 
															+Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
														
 
															+were connected to the same RC, as opposed to Setup 1. As stated in the
														
 
															+NVIDIA's GPUDirect documentation, the devices must share the same RC to
														
 
															+achieve the best performance~\cite{cuda_doc}.  In case of FPGA-to-CPU data
														
 
															+transfers, the software implementation is the one described
														
 
															 in~\cite{rota2015dma}.
														
 
															 \begin{table}[]
														
@@ -320,21 +322,22 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 
															 \label{fig:throughput}
														
 
															 \end{figure}
														
 
															-The measured results for the pure data throughput is shown in
														
 
															-\figref{fig:throughput} for transfers to the system's main
														
 
															-memory as well as to the global memory. In the
														
 
															-case of FPGA-to-GPU data transfers, the double buffering solution was used.
														
 
															-As one can see, in both cases the write performance is primarily limited by
														
 
															-the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
														
 
															-approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
														
 
															-GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
														
 
															-throughput saturates earlier but the maximum throughput is 6.6 GB/s.
														
 
															-The different slope and saturation point is a direct consequence of the different handshaking implementation. 
														
 
															-
														
 
															+In order to evaluate the performance of the DMA engine, measurements of pure
														
 
															+data throughput were carried out using Setup 1. The results are shown in
														
 
															+\figref{fig:throughput} for transfers to the system's main memory as well as
														
 
															+to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
														
 
															+double buffering mechanism was used. As one can see, in both cases the write
														
 
															+performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
														
 
															+size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
														
 
															+the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
														
 
															+throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
														
 
															+and maximum performance depend on the different implementation of the
														
 
															+handshaking sequence between DMA engine and the hosts.
														
 
															 %% --------------------------------------------------------------------------
														
 
															 \subsection{Latency}
														
 
															+
														
 
															 \begin{figure}[t]
														
 
															   \centering
														
 
															   \begin{subfigure}[b]{.49\textwidth}
														
@@ -361,11 +364,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 
															 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
														
 
															 the round-trip latency of the system.
														
 
															-The results of 1000 measurements of the round-trip latency are shown in
														
 
															-\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
														
 
															-standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
														
 
															-systemic influence that we cannot control and is most likely caused by the
														
 
															-non-deterministic run-time behaviour of the operating system scheduler.
														
 
															+The results of 1000 measurements of the round-trip latency using system memory
														
 
															+are shown in \figref{fig:latency-hist}. The latency for Setup 1 and Setup 2
														
 
															+are, respectively, XXX \textmu s and XXX \textmu s. The non-Gaussian
														
 
															+distribution indicates a systemic influence that we cannot control and is most
														
 
															+likely caused by the non-deterministic run-time behaviour of the operating
														
 
															+system scheduler.
														
 
															 \section{Conclusion and outlook}
														
@@ -379,17 +383,18 @@ data transfer. The measurements on a low-end system based on an Intel Atom CPU
 
															 showed no significant difference in throughput performance. Depending on the
														
 
															 application and computing requirements, this result makes smaller acquisition
														
 
															 system a cost-effective alternative to larger workstations.
														
 
															- 
														
 
															-We also evaluated the performance of DirectGMA technology for low latency
														
 
															-applications. Preliminary results indicate that latencies as low as 2 \textmu
														
 
															-s can be achieved in data transfer to system main memory. However, at the time
														
 
															-of writing this paper, the latency introduced by DirectGMA OpenCL functions is
														
 
															-in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
														
 
															-interfacing code is ongoing with the help of technical support by AMD, in
														
 
															-order to lift the limitation introduced by OpenCL scheduling and make our
														
 
															-implementation suitable for low latency applications. Moreover, as opposed to
														
 
															-the previous case, measurements show that for low latency applications
														
 
															-dedicated hardware must be used in order to achieve the best performance.
														
 
															+
														
 
															+We measured a round-trip latency of 1 \textmu s when transfering data between
														
 
															+the DMA engine with system main memory. We also assessed the applicability of
														
 
															+DirectGMA in low latency applications: preliminary results shows that
														
 
															+latencies as low as 2 \textmu s can by achieved when writing data in the GPU
														
 
															+memory.  However, at the time of writing this paper, the latency introduced by
														
 
															+OpenCL scheduling is in the range of hundreds of \textmu s. Therefore,
														
 
															+optimization of the GPU- DMA interfacing OpenCL code is ongoing with the help
														
 
															+of technical support by AMD, in order to lift the current limitation and
														
 
															+enable the use of our implementation in low latency applications. Moreover,
														
 
															+measurements show that for low latency applications dedicated hardware must be
														
 
															+used in order to achieve the best performance.
														
 
															 In order to increase the total throughput, a custom FPGA evaluation board is
														
 
															 currently under development. The board mounts a Virtex-7 chip and features two
														
@@ -399,7 +404,7 @@ as a single x16 device by using an external PCIe switch. With two cores
 
															 operating in parallel, we foresee an increase in the data throughput by a
														
 
															 factor of 2 (as demonstrated in~\cite{rota2015dma}).
														
 
															-The software solution that we proposed allows seamless multi-GPU processing of
														
 
															+The proposed software solution allows seamless multi-GPU processing of
														
 
															 the incoming data, due to the integration in our streamed computing framework.
														
 
															 This allows straightforward integration with different DAQ systems and
														
 
															 introduction of custom data processing algorithms.