Browse Source

Fixed conclusion

lorenzo 8 years ago
parent
commit
4ca6c256dd
2 changed files with 51 additions and 42 deletions
  1. 6 2
      literature.bib
  2. 45 40
      paper.tex

+ 6 - 2
literature.bib

@@ -1,11 +1,15 @@
+@misc{cuda_doc,
+    title = {GPUDirect - CUDA Toolkit documentation},
+    url = {http://docs.nvidia.com/cuda/gpudirect-rdma/#axzz3rU8P2Jwg},
+}
+
 @CONFERENCE{caselle,
 author={Caselle, M. and al.},
-title={Commissioning of an ultra-fast data acquisition system for coherent synchrotron radiation detection},
+title={Commissioning of an ultra fast data acquisition system for coherent synchrotron radiation detection},
 journal={IPAC 2014: Proceedings of the 5th International Particle Accelerator Conference},
 year={2014},
 pages={3497-3499},
 url={http://www.scopus.com/inward/record.url?eid=2-s2.0-84928346423&partnerID=40&md5=5775c1bc623215f734e33d5e5f9b4a9a},
-document_type={Conference Paper},
 }
 
 @ARTICLE{ufo_camera, 

+ 45 - 40
paper.tex

@@ -45,12 +45,12 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 level software to manage direct   memory transfers using AMD's DirectGMA
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
 for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
-the possibility of using DirectGMA in low latency systems: preliminary
-measurements show a latency as low as 1 \textmu s for data transfers to GPU
-memory. The additional latency introduced by OpenCL scheduling is the current
-performance bottleneck.  Our implementation is suitable for real- time DAQ
-system applications ranging from photon science and medical imaging to High
-Energy Physics (HEP) systems.}
+the possibility of using our architecture in low latency systems: preliminary
+measurements show a round-trip latency as low as 1 \textmu s for data
+transfers to system memory, while the additional latency introduced by OpenCL
+scheduling is the current limitation for GPU based systems.  Our
+implementation is suitable for real- time DAQ system applications ranging from
+photon science and medical imaging to High Energy Physics (HEP) systems.}
 
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 
@@ -280,13 +280,15 @@ Python.
 \section{Results}
 
 We carried out performance measurements on two different setups, which are
-described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
-used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
-3.0 slot, but the two devices were connected to different PCIe Root Complexes
+described in table~\ref{table:setups}. In both setups, a Xilinx VC709
+evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
+into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
 (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
-Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
-were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
-data transfers, the software implementation is the one described
+Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
+were connected to the same RC, as opposed to Setup 1. As stated in the
+NVIDIA's GPUDirect documentation, the devices must share the same RC to
+achieve the best performance~\cite{cuda_doc}.  In case of FPGA-to-CPU data
+transfers, the software implementation is the one described
 in~\cite{rota2015dma}.
 
 \begin{table}[]
@@ -320,21 +322,22 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 \label{fig:throughput}
 \end{figure}
 
-The measured results for the pure data throughput is shown in
-\figref{fig:throughput} for transfers to the system's main
-memory as well as to the global memory. In the
-case of FPGA-to-GPU data transfers, the double buffering solution was used.
-As one can see, in both cases the write performance is primarily limited by
-the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
-approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
-GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
-throughput saturates earlier but the maximum throughput is 6.6 GB/s.
-The different slope and saturation point is a direct consequence of the different handshaking implementation. 
-
+In order to evaluate the performance of the DMA engine, measurements of pure
+data throughput were carried out using Setup 1. The results are shown in
+\figref{fig:throughput} for transfers to the system's main memory as well as
+to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
+double buffering mechanism was used. As one can see, in both cases the write
+performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
+size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
+the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
+throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
+and maximum performance depend on the different implementation of the
+handshaking sequence between DMA engine and the hosts.
 
 %% --------------------------------------------------------------------------
 \subsection{Latency}
 
+
 \begin{figure}[t]
   \centering
   \begin{subfigure}[b]{.49\textwidth}
@@ -361,11 +364,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
 the round-trip latency of the system.
 
-The results of 1000 measurements of the round-trip latency are shown in
-\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
-standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
-systemic influence that we cannot control and is most likely caused by the
-non-deterministic run-time behaviour of the operating system scheduler.
+The results of 1000 measurements of the round-trip latency using system memory
+are shown in \figref{fig:latency-hist}. The latency for Setup 1 and Setup 2
+are, respectively, XXX \textmu s and XXX \textmu s. The non-Gaussian
+distribution indicates a systemic influence that we cannot control and is most
+likely caused by the non-deterministic run-time behaviour of the operating
+system scheduler.
 
 
 \section{Conclusion and outlook}
@@ -379,17 +383,18 @@ data transfer. The measurements on a low-end system based on an Intel Atom CPU
 showed no significant difference in throughput performance. Depending on the
 application and computing requirements, this result makes smaller acquisition
 system a cost-effective alternative to larger workstations.
- 
-We also evaluated the performance of DirectGMA technology for low latency
-applications. Preliminary results indicate that latencies as low as 2 \textmu
-s can be achieved in data transfer to system main memory. However, at the time
-of writing this paper, the latency introduced by DirectGMA OpenCL functions is
-in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
-interfacing code is ongoing with the help of technical support by AMD, in
-order to lift the limitation introduced by OpenCL scheduling and make our
-implementation suitable for low latency applications. Moreover, as opposed to
-the previous case, measurements show that for low latency applications
-dedicated hardware must be used in order to achieve the best performance.
+
+We measured a round-trip latency of 1 \textmu s when transfering data between
+the DMA engine with system main memory. We also assessed the applicability of
+DirectGMA in low latency applications: preliminary results shows that
+latencies as low as 2 \textmu s can by achieved when writing data in the GPU
+memory.  However, at the time of writing this paper, the latency introduced by
+OpenCL scheduling is in the range of hundreds of \textmu s. Therefore,
+optimization of the GPU- DMA interfacing OpenCL code is ongoing with the help
+of technical support by AMD, in order to lift the current limitation and
+enable the use of our implementation in low latency applications. Moreover,
+measurements show that for low latency applications dedicated hardware must be
+used in order to achieve the best performance.
 
 In order to increase the total throughput, a custom FPGA evaluation board is
 currently under development. The board mounts a Virtex-7 chip and features two
@@ -399,7 +404,7 @@ as a single x16 device by using an external PCIe switch. With two cores
 operating in parallel, we foresee an increase in the data throughput by a
 factor of 2 (as demonstrated in~\cite{rota2015dma}).
 
-The software solution that we proposed allows seamless multi-GPU processing of
+The proposed software solution allows seamless multi-GPU processing of
 the incoming data, due to the integration in our streamed computing framework.
 This allows straightforward integration with different DAQ systems and
 introduction of custom data processing algorithms.