Kaynağa Gözat

Fixed conclusion

lorenzo 8 yıl önce
ebeveyn
işleme
4ca6c256dd
2 değiştirilmiş dosya ile 51 ekleme ve 42 silme
  1. 6 2
      literature.bib
  2. 45 40
      paper.tex

+ 6 - 2
literature.bib

@@ -1,11 +1,15 @@
+@misc{cuda_doc,
+    title = {GPUDirect - CUDA Toolkit documentation},
+    url = {http://docs.nvidia.com/cuda/gpudirect-rdma/#axzz3rU8P2Jwg},
+}
+
 @CONFERENCE{caselle,
 @CONFERENCE{caselle,
 author={Caselle, M. and al.},
 author={Caselle, M. and al.},
-title={Commissioning of an ultra-fast data acquisition system for coherent synchrotron radiation detection},
+title={Commissioning of an ultra fast data acquisition system for coherent synchrotron radiation detection},
 journal={IPAC 2014: Proceedings of the 5th International Particle Accelerator Conference},
 journal={IPAC 2014: Proceedings of the 5th International Particle Accelerator Conference},
 year={2014},
 year={2014},
 pages={3497-3499},
 pages={3497-3499},
 url={http://www.scopus.com/inward/record.url?eid=2-s2.0-84928346423&partnerID=40&md5=5775c1bc623215f734e33d5e5f9b4a9a},
 url={http://www.scopus.com/inward/record.url?eid=2-s2.0-84928346423&partnerID=40&md5=5775c1bc623215f734e33d5e5f9b4a9a},
-document_type={Conference Paper},
 }
 }
 
 
 @ARTICLE{ufo_camera, 
 @ARTICLE{ufo_camera, 

+ 45 - 40
paper.tex

@@ -45,12 +45,12 @@ the Xilinx PCI-Express core,   a Linux driver for register access, and high-
 level software to manage direct   memory transfers using AMD's DirectGMA
 level software to manage direct   memory transfers using AMD's DirectGMA
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
 technology. Measurements with a Gen3\,x8 link show a throughput of 6.4~GB/s
 for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
 for transfers to GPU memory and 6.6~GB/s to system memory.  We also assesed
-the possibility of using DirectGMA in low latency systems: preliminary
-measurements show a latency as low as 1 \textmu s for data transfers to GPU
-memory. The additional latency introduced by OpenCL scheduling is the current
-performance bottleneck.  Our implementation is suitable for real- time DAQ
-system applications ranging from photon science and medical imaging to High
-Energy Physics (HEP) systems.}
+the possibility of using our architecture in low latency systems: preliminary
+measurements show a round-trip latency as low as 1 \textmu s for data
+transfers to system memory, while the additional latency introduced by OpenCL
+scheduling is the current limitation for GPU based systems.  Our
+implementation is suitable for real- time DAQ system applications ranging from
+photon science and medical imaging to High Energy Physics (HEP) systems.}
 
 
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 \keywords{FPGA; GPU; PCI-Express; OpenCL; DirectGMA}
 
 
@@ -280,13 +280,15 @@ Python.
 \section{Results}
 \section{Results}
 
 
 We carried out performance measurements on two different setups, which are
 We carried out performance measurements on two different setups, which are
-described in table~\ref{table:setups}. A Xilinx VC709 evaluation board was
-used in both setups. In Setup 1, the FPGA and the GPU were plugged into a PCIe
-3.0 slot, but the two devices were connected to different PCIe Root Complexes
+described in table~\ref{table:setups}. In both setups, a Xilinx VC709
+evaluation board was used. In Setup 1, the FPGA board and the GPU were plugged
+into a PCIe 3.0 slot, but they were connected to different PCIe Root Complexes
 (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
 (RC). In Setup 2, a low-end Supermicro X7SPA-HF-D525 system was connected to a
-Netstor NA255A external PCIe enclosure, where both the FPGA board and the GPU
-were connected to the same RC, as opposed to Setup 1. In case of FPGA-to-CPU
-data transfers, the software implementation is the one described
+Netstor NA255A xeternal PCIe enclosure, where both the FPGA board and the GPU
+were connected to the same RC, as opposed to Setup 1. As stated in the
+NVIDIA's GPUDirect documentation, the devices must share the same RC to
+achieve the best performance~\cite{cuda_doc}.  In case of FPGA-to-CPU data
+transfers, the software implementation is the one described
 in~\cite{rota2015dma}.
 in~\cite{rota2015dma}.
 
 
 \begin{table}[]
 \begin{table}[]
@@ -320,21 +322,22 @@ PCIe slot: FPGA \& GPU    & x8 Gen3 (different RC) & x8 Gen3 (same RC)    \\
 \label{fig:throughput}
 \label{fig:throughput}
 \end{figure}
 \end{figure}
 
 
-The measured results for the pure data throughput is shown in
-\figref{fig:throughput} for transfers to the system's main
-memory as well as to the global memory. In the
-case of FPGA-to-GPU data transfers, the double buffering solution was used.
-As one can see, in both cases the write performance is primarily limited by
-the PCIe bus. Up until 2 MB data transfer size, the throughput to the GPU is
-approaching slowly 100 MB/s. From there on, the throughput increases up to 6.4
-GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
-throughput saturates earlier but the maximum throughput is 6.6 GB/s.
-The different slope and saturation point is a direct consequence of the different handshaking implementation. 
-
+In order to evaluate the performance of the DMA engine, measurements of pure
+data throughput were carried out using Setup 1. The results are shown in
+\figref{fig:throughput} for transfers to the system's main memory as well as
+to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
+double buffering mechanism was used. As one can see, in both cases the write
+performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
+size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
+the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
+throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
+and maximum performance depend on the different implementation of the
+handshaking sequence between DMA engine and the hosts.
 
 
 %% --------------------------------------------------------------------------
 %% --------------------------------------------------------------------------
 \subsection{Latency}
 \subsection{Latency}
 
 
+
 \begin{figure}[t]
 \begin{figure}[t]
   \centering
   \centering
   \begin{subfigure}[b]{.49\textwidth}
   \begin{subfigure}[b]{.49\textwidth}
@@ -361,11 +364,12 @@ A counter on the FPGA measures the time interval between the \emph{start\_dma}
 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
 the round-trip latency of the system.
 the round-trip latency of the system.
 
 
-The results of 1000 measurements of the round-trip latency are shown in
-\figref{fig:latency-hist}. The latency has a mean value of XXX \textmu s and a
-standard variation of XXX \textmu s. The non-Gaussian distribution indicates a
-systemic influence that we cannot control and is most likely caused by the
-non-deterministic run-time behaviour of the operating system scheduler.
+The results of 1000 measurements of the round-trip latency using system memory
+are shown in \figref{fig:latency-hist}. The latency for Setup 1 and Setup 2
+are, respectively, XXX \textmu s and XXX \textmu s. The non-Gaussian
+distribution indicates a systemic influence that we cannot control and is most
+likely caused by the non-deterministic run-time behaviour of the operating
+system scheduler.
 
 
 
 
 \section{Conclusion and outlook}
 \section{Conclusion and outlook}
@@ -379,17 +383,18 @@ data transfer. The measurements on a low-end system based on an Intel Atom CPU
 showed no significant difference in throughput performance. Depending on the
 showed no significant difference in throughput performance. Depending on the
 application and computing requirements, this result makes smaller acquisition
 application and computing requirements, this result makes smaller acquisition
 system a cost-effective alternative to larger workstations.
 system a cost-effective alternative to larger workstations.
- 
-We also evaluated the performance of DirectGMA technology for low latency
-applications. Preliminary results indicate that latencies as low as 2 \textmu
-s can be achieved in data transfer to system main memory. However, at the time
-of writing this paper, the latency introduced by DirectGMA OpenCL functions is
-in the range of hundreds of \textmu s.  Optimization of the GPU-DMA
-interfacing code is ongoing with the help of technical support by AMD, in
-order to lift the limitation introduced by OpenCL scheduling and make our
-implementation suitable for low latency applications. Moreover, as opposed to
-the previous case, measurements show that for low latency applications
-dedicated hardware must be used in order to achieve the best performance.
+
+We measured a round-trip latency of 1 \textmu s when transfering data between
+the DMA engine with system main memory. We also assessed the applicability of
+DirectGMA in low latency applications: preliminary results shows that
+latencies as low as 2 \textmu s can by achieved when writing data in the GPU
+memory.  However, at the time of writing this paper, the latency introduced by
+OpenCL scheduling is in the range of hundreds of \textmu s. Therefore,
+optimization of the GPU- DMA interfacing OpenCL code is ongoing with the help
+of technical support by AMD, in order to lift the current limitation and
+enable the use of our implementation in low latency applications. Moreover,
+measurements show that for low latency applications dedicated hardware must be
+used in order to achieve the best performance.
 
 
 In order to increase the total throughput, a custom FPGA evaluation board is
 In order to increase the total throughput, a custom FPGA evaluation board is
 currently under development. The board mounts a Virtex-7 chip and features two
 currently under development. The board mounts a Virtex-7 chip and features two
@@ -399,7 +404,7 @@ as a single x16 device by using an external PCIe switch. With two cores
 operating in parallel, we foresee an increase in the data throughput by a
 operating in parallel, we foresee an increase in the data throughput by a
 factor of 2 (as demonstrated in~\cite{rota2015dma}).
 factor of 2 (as demonstrated in~\cite{rota2015dma}).
 
 
-The software solution that we proposed allows seamless multi-GPU processing of
+The proposed software solution allows seamless multi-GPU processing of
 the incoming data, due to the integration in our streamed computing framework.
 the incoming data, due to the integration in our streamed computing framework.
 This allows straightforward integration with different DAQ systems and
 This allows straightforward integration with different DAQ systems and
 introduction of custom data processing algorithms.
 introduction of custom data processing algorithms.