|
@@ -20,13 +20,11 @@
|
|
|
}
|
|
|
|
|
|
\abstract{%
|
|
|
- %% Old
|
|
|
- % \emph{A growing number of physics experiments requires DAQ systems with multi-GB/s
|
|
|
- % data links.}
|
|
|
- %proposal for new abstract, including why do we need GPUs
|
|
|
Modern physics experiments have reached multi-GB/s data rates. Fast data
|
|
|
links and high performance computing stages are required for continuous
|
|
|
acquisition and processing. Because of their intrinsic parallelism and
|
|
|
+ % I would remove the computing from here and leave "ideal solution",
|
|
|
+ % afterwards we have again computing...
|
|
|
computational power, GPUs emerged as an ideal computing solution for high
|
|
|
performance computing applications. To connect a fast data acquisition stage
|
|
|
with a GPU's processing power, we developed an architecture consisting of a
|
|
@@ -75,6 +73,7 @@ host computer. Due to its high bandwidth and modularity, PCIe quickly became the
|
|
|
commercial standard for connecting high-throughput peripherals such as GPUs or
|
|
|
solid state disks. Optical PCIe networks have been demonstrated
|
|
|
% JESUS: time span -> for, point in time -> since ...
|
|
|
+% BUDDHA: Ok boss. I wanted to say "since 10 years ago...", is for ok?
|
|
|
for nearly a decade~\cite{optical_pcie}, opening the possibility of using PCIe
|
|
|
as a communication bus over long distances. In particular, in HEP DAQ systems,
|
|
|
optical links are preferred over electrical ones because of their superior
|
|
@@ -96,7 +95,6 @@ DMA data transfers are handled by dedicated hardware, which compared with
|
|
|
Programmed Input Output (PIO) access, offer lower latency and higher throughput
|
|
|
at the cost of higher system complexity.
|
|
|
|
|
|
-
|
|
|
\begin{figure}[t]
|
|
|
\centering
|
|
|
\includegraphics[width=1.0\textwidth]{figures/transf}
|
|
@@ -146,16 +144,21 @@ address is 2 GB.
|
|
|
On the host side, AMD's DirectGMA technology, an implementation of the
|
|
|
bus-addressable memory extension for OpenCL 1.1 and later, is used to write from
|
|
|
the FPGA to GPU memory and from the GPU to the FPGA's control registers.
|
|
|
-\figref{fig:opencl-setup} illustrates the main mode of operation: To write into
|
|
|
-the GPU, the physical bus address of the GPU buffer is determined with a call to
|
|
|
+\figref{fig:opencl-setup} illustrates the main mode of operation: to write into
|
|
|
+the GPU, the physical bus addresses of the GPU buffers are determined with a call to
|
|
|
\texttt{clEnqueue\-Make\-Buffers\-Resident\-AMD} and set by the host CPU in a
|
|
|
control register of the FPGA (1). The FPGA then writes data blocks autonomously
|
|
|
-in DMA fashion (2). Due to hardware restrictions the largest possible GPU buffer
|
|
|
+in DMA fashion (2).
|
|
|
+% BUDDHA: This part is not true. We need to always do the handshaking if we transfer
|
|
|
+% more than 95 MBs, which I assume is always the case, otherwise we owuld not need a DMA...
|
|
|
+Due to hardware restrictions the largest possible GPU buffer
|
|
|
sizes are about 95 MB but larger transfers can be achieved using a double
|
|
|
buffering mechanism. Because the GPU provides a flat memory address space and
|
|
|
our DMA engine allows multiple destination addresses to be set in advance, we
|
|
|
can determine all addresses before the actual transfers thus keeping the
|
|
|
CPU out of the transfer loop.
|
|
|
+%% BUDDHA: the CPU is still involved in the loop at the moment. We didn't manage
|
|
|
+% to move the handshaking completely to the GPU, did we?
|
|
|
|
|
|
To signal events to the FPGA (4), the control registers can be mapped into the
|
|
|
GPU's address space passing a special AMD-specific flag and passing the physical
|
|
@@ -180,12 +183,14 @@ plugin for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
|
|
|
framework allows for an easy construction of streamed data processing on
|
|
|
heterogeneous multi-GPU systems. For example, to read data from the FPGA, decode
|
|
|
its specific format and run a Fourier transform on the GPU as well as writing
|
|
|
-back the results to disk, one can run \texttt{ufo-launch direct-gma ! decode !
|
|
|
-fft ! write filename=out.raw} on the command line. The framework will take care
|
|
|
-of scheduling the tasks and distribute the data items according. A
|
|
|
-complementary application programming interface allows users to develop custom
|
|
|
-applications written in C or high-level languages such as Python. High
|
|
|
-throughput is achieved by the combination of fine- and coarse-grained data
|
|
|
+back the results to disk, one can run on the command line:
|
|
|
+% BUDDHA: I like this point very very much, formatting helps to make it stand out
|
|
|
+\begin{verbatim}ufo-launch direct-gma ! decode ! fft ! write filename=out.raw
|
|
|
+\end{verbatim} %%
|
|
|
+The framework will take care of scheduling the tasks and distribute the data items
|
|
|
+according. A complementary application programming interface allows users to
|
|
|
+develop custom applications written in C or high-level languages such as Python.
|
|
|
+High throughput is achieved by the combination of fine- and coarse-grained data
|
|
|
parallelism, \emph{i.e.} processing a single data item on a GPU using thousands
|
|
|
of threads and by splitting the data stream and feeding individual data items to
|
|
|
separate GPUs. None of this requires any user intervention and is solely
|
|
@@ -215,6 +220,7 @@ strategy a viable solution for very large data transfers.
|
|
|
transfers required to fill the destination buffer. The throughput has been
|
|
|
estimated using the host side wall clock time. On-GPU data transfer is about
|
|
|
twice as fast.
|
|
|
+ %% BUDDHA: forgive my ignorance: what does it mean "on-gpu"?
|
|
|
}
|
|
|
\label{fig:intra-copy}
|
|
|
\end{figure}
|
|
@@ -222,6 +228,9 @@ strategy a viable solution for very large data transfers.
|
|
|
|
|
|
\subsection{Throughput}
|
|
|
|
|
|
+%% BUDDHA: why do we need to state this thing? High-throughput affects also the
|
|
|
+%% total latency. One can optimize for one or the other probably, but at the moment
|
|
|
+%% we use the same approach, so I would not write this.
|
|
|
A high throughput is desired for applications in which the FPGA outputs large
|
|
|
amounts of data and timing is not an issue. This includes fast, high resolution
|
|
|
photon detectors as used in synchrotron facilities.
|
|
@@ -231,7 +240,7 @@ For both system and GPU memory, the write performance is primarily limited by
|
|
|
the PCIe bus. Higher payloads introduce less overhead, thus increasing the net
|
|
|
bandwidth. Up until 2 MB transfer size, the performance is almost the same,
|
|
|
after that the GPU transfer shows a slightly better slope. Data transfers larger
|
|
|
-than 1 GB saturate the PCIe bus. \bf{LR: We should measure the slope for
|
|
|
+than 1 GB saturate the PCIe bus. \textbf{LR: We should measure the slope for
|
|
|
different page sizes, I expect the saturation point to change for different
|
|
|
page sizes}
|
|
|
|
|
@@ -250,6 +259,8 @@ page sizes}
|
|
|
|
|
|
%% Change the specs for the small crate
|
|
|
% MV: we never did anything in that regard
|
|
|
+% LR: Nicholas did, and he said there was no difference in FPGA-GPU
|
|
|
+
|
|
|
% For FPGA-to-GPU transfers, we also repeated the measurements using a low-end system
|
|
|
% based on XXX and Intel Nano XXXX. The results does not show any significant difference
|
|
|
% compared to the previous setup, making it a more cost-effective solution.
|
|
@@ -263,20 +274,27 @@ page sizes}
|
|
|
\label{fig:intra-copy}
|
|
|
\end{figure}
|
|
|
|
|
|
-%% Latency here? What do we do?
|
|
|
-%% We should add an histogram with 1000+ measurements of the latency to see if it's time-deterministic
|
|
|
-%% Also: add a plot of latency vs different data sizes transmitted (from FPGA)
|
|
|
+%% Latency distribution plot: perfect! we should also add as legend avg=168.x us, sigma=2 us, max=180 us
|
|
|
+
|
|
|
+%% Here: instead of this useless plot, we can plot the latency vs different data sizes transmitted (from FPGA). It should reach 50% less for large data transfers, even with our current limitation... Maybe we can also try on a normal desktop ?
|
|
|
\begin{figure}
|
|
|
\centering
|
|
|
\includegraphics[width=0.6\textwidth]{figures/latency}
|
|
|
\caption{%
|
|
|
- The data transmission latency is decreased by XXX percent with respect to the traditional
|
|
|
- approach (a) by using DirectGMA (b). The latency has been measured by taking the round-trip time
|
|
|
- for a 4k pac
|
|
|
+ For data transfers larger than XX MB, latency is decreased by XXX percent with respect to the traditional approach (a) by using our implementation (b).
|
|
|
}
|
|
|
\label{fig:latency}
|
|
|
\end{figure}
|
|
|
|
|
|
+% In case everything is fine.
|
|
|
+\ref{fig:latency} shows the comparison between the traditional approach and GPU DMA data transfer.
|
|
|
+The total latency is decreased
|
|
|
+The distribution of latency is shown in Figure \ref{fig:latency}.
|
|
|
+
|
|
|
+%% EMERGENCY TEXT if we don't manage to fix the latency problem
|
|
|
+The round-trip time of a memory read request issued from the CPU to the FPGA is less then 1 $\mu$s. Therefore, the current performance bottleneck lies in the execution of DirectGMA functions.
|
|
|
+
|
|
|
+
|
|
|
|
|
|
\section{Conclusion}
|
|
|
|
|
@@ -290,10 +308,14 @@ integration of the DMA transfer setup in our streamed computing framework.
|
|
|
Integration with different DAQ systems and custom algorithms is therefore
|
|
|
immediate.
|
|
|
|
|
|
-
|
|
|
\subsection{Outlook}
|
|
|
|
|
|
-Support for NVIDIA's GPUDirect technology is also foreseen in the next months to
|
|
|
+%Add if we cannot fix latency
|
|
|
+An optimization of the OpenCL code in ongoing, with the help of AMD technical support.
|
|
|
+With a better understanding of the hardware and software aspects of DirectGMA, we expect
|
|
|
+a significant improvement in latency performance.
|
|
|
+
|
|
|
+Support for NVIDIA's GPUDirect technology is foreseen in the next months to
|
|
|
lift the limitation of one specific GPU vendor and direct performance comparison will be possible.
|
|
|
|
|
|
A custom FPGA evaluation board is currently under development in order to
|
|
@@ -306,10 +328,11 @@ we foresee an increase in the data throughput by a factor of 2 (as demonstrated
|
|
|
\textbf{LR: Instead of swapping PCIe-infinib, I would say include it in the architecture.
|
|
|
A big house for all these love-lacking protocols.}
|
|
|
|
|
|
-It is our intention to add Infiniband support. I NEED TO READ
|
|
|
-WHAT ARE THE ADVANTAGES VS PCIe.
|
|
|
+It is our intention to add Infiniband support.
|
|
|
+\textbf{I NEED TO READ
|
|
|
+WHAT ARE THE ADVANTAGES VS PCIe. Update: internet sucks in China.}
|
|
|
|
|
|
-\textbf{LR:Here comes the visionary Luigi...}
|
|
|
+%LR:Here comes the visionary Luigi
|
|
|
Our goal is to develop a unique hybrid solution, based
|
|
|
on commercial standards, that includes fast data transmission protocols and a high performance
|
|
|
GPU computing framework.
|