8 years ago · 2705543d55
--- a/paper.tex
+++ b/paper.tex
@@ -75,14 +75,12 @@ GB/s~\cite{ufo_camera, caselle}. In a typical scenario, data are acquired by
 
				 back-end readout systems and then transmitted in short bursts or in a
			
 
				 continuous streaming mode to a computing stage. In order to collect data over
			
 
				 long observation times, the readout architecture and the computing stages must
			
 
				-be able to sustain high data rates.
			
 
				-
			
 
				-Recent years have also seen an increasing interest in GPU-based systems for
			
 
				-High Energy Physics (HEP)  (\emph{e.g.} ATLAS~\cite{atlas_gpu},
			
 
				-ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu}, PANDA~\cite{panda_gpu}) and
			
 
				-photon science experiments. In time-deterministic applications,\emph{e.g.} in
			
 
				-Low/High-level trigger systems, latency becomes the most stringent
			
 
				-requirement.
			
 
				+be able to sustain high data rates. Recent years have also seen an increasing
			
 
				+interest in GPU-based systems for High Energy Physics (HEP)  (\emph{e.g.}
			
 
				+ATLAS~\cite{atlas_gpu}, ALICE~\cite{alice_gpu}, Mu3e~\cite{mu3e_gpu},
			
 
				+PANDA~\cite{panda_gpu}) and photon science experiments. In time-deterministic
			
 
				+applications, such as Low/High-level trigger systems, latency becomes
			
 
				+the most stringent requirement.
			
 
				 
			
 
				 Due to its high bandwidth and modularity, PCIe quickly became the commercial
			
 
				 standard for connecting high-throughput peripherals such as GPUs or solid
			
@@ -90,29 +88,23 @@ state disks. Moreover, optical PCIe networks have been demonstrated a decade
 
				 ago~\cite{optical_pcie}, opening the possibility of using PCIe as a
			
 
				 communication link over long distances.
			
 
				 
			
 
				-Several solutions for direct FPGA/GPU communication based on PCIe are reported
			
 
				+Several solutions for direct FPGA-GPU communication based on PCIe are reported
			
 
				 in literature, and all of them are based on NVIDIA's GPUdirect technology. In
			
 
				-the implementation of bittnerner and Ruf ~\cite{bittner} the GPU acts as
			
 
				+the implementation of Bittnerner and Ruf ~\cite{bittner} the GPU acts as
			
 
				 master during an FPGA-to-GPU data transfer, reading data from the FPGA.  This
			
 
				 solution limits the reported bandwidth and latency to 514 MB/s and 40~\textmu
			
 
				-s, respectively.
			
 
				-
			
 
				-%LR: FPGA^2 it's the name of their thing... 
			
 
				-%MV: best idea in the world :)
			
 
				-%LR: Let's call ours FPGA^2_GPU
			
 
				-
			
 
				-When the FPGA is used as a master, a higher throughput can be achieved.  An
			
 
				-example of this approach is the \emph{FPGA\textsuperscript{2}} framework by Thoma
			
 
				-et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x Gen2.0 data link.
			
 
				-Lonardo et~al.\ achieved low latencies with their NaNet design, an FPGA-based
			
 
				-PCIe network interface card~\cite{lonardo2015nanet}.  The Gbe link however
			
 
				-limits the latency performance of the system to a few tens of \textmu s. If
			
 
				-only the FPGA-to-GPU latency is considered, the measured values span between
			
 
				-1~\textmu s and 6~\textmu s, depending on the datagram size. Moreover, the
			
 
				-bandwidth saturates at 120 MB/s. Nieto et~al.\ presented a system based on a
			
 
				-PXIexpress data link that makes use of four PCIe 1.0
			
 
				-links~\cite{nieto2015high}. Their system (as limited by the interconnect)
			
 
				-achieves an average throughput of 870 MB/s with 1 KB block transfers.
			
 
				+s, respectively. When the FPGA is used as a master, a higher throughput can be
			
 
				+achieved.  An example of this approach is the \emph{FPGA\textsuperscript{2}}
			
 
				+framework by Thoma et~al.\cite{thoma}, which reaches 2454 MB/s using a 8x
			
 
				+Gen2.0 data link. Lonardo et~al.\ achieved low latencies with their NaNet
			
 
				+design, an FPGA-based PCIe network interface card~\cite{lonardo2015nanet}. The
			
 
				+Gbe link however limits the latency performance of the system to a few tens of
			
 
				+\textmu s. If only the FPGA-to-GPU latency is considered, the measured values
			
 
				+span between 1~\textmu s and 6~\textmu s, depending on the datagram size.
			
 
				+Nieto et~al.\ presented a system based on a PXIexpress data link that makes
			
 
				+use of four PCIe 1.0 links~\cite{nieto2015high}. Their system (as limited by
			
 
				+the interconnect) achieves an average throughput of 870 MB/s with 1 KB block
			
 
				+transfers.
			
 
				 
			
 
				 In order to achieve the best performance in terms of latency and bandwidth, we
			
 
				 developed a high-performance DMA engine based on Xilinx's PCIe Gen3 Core. To
			
@@ -121,11 +113,11 @@ for our scalable GPU processing framework~\cite{vogelgesang2012ufo}. This
 
				 framework allows for an easy construction of streamed data processing on
			
 
				 heterogeneous multi-GPU systems. Because the framework is based on OpenCL,
			
 
				 integration with NVIDIA's CUDA functions for GPUDirect technology is not
			
 
				-possible at the moment. Thus, we used AMD's DirectGMA technology to integrate
			
 
				-direct FPGA-to-GPU communication into our processing pipeline. In this paper
			
 
				-we report the performance of our DMA engine for FPGA-to-CPU communication and
			
 
				-some preliminary measurements about DirectGMA's performance in low-latency
			
 
				-applications.
			
 
				+possible at the moment. We therefore used AMD's DirectGMA technology to
			
 
				+integrate direct FPGA-to-GPU communication into our processing pipeline. In
			
 
				+this paper we report the throughput performance of our architecture together
			
 
				+with some preliminary measurements about DirectGMA's applicability in low-
			
 
				+latency applications.
			
 
				 
			
 
				 %% LR: this part -> OK
			
 
				 \section{Architecture}
			
@@ -238,18 +230,19 @@ as bus master and pushes data to the FPGA.
 
				 
			
 
				 %% Double Buffering strategy. 
			
 
				 
			
 
				-Due to hardware restrictions the largest possible GPU buffer sizes are about 95
			
 
				-MB but larger transfers can be achieved by using a double buffering mechanism.
			
 
				-data are copied from the DirectGMA buffer exposed to the FPGA into a different
			
 
				-GPU buffer. To verify that we can keep up with the incoming data throughput
			
 
				-using this strategy, we measured the data throughput within a GPU by copying
			
 
				-data from a smaller sized buffer representing the DMA buffer to a larger
			
 
				-destination buffer. At a block size of about 384 KB the throughput surpasses the
			
 
				-maximum possible PCIe bandwidth, and it reaches 40 GB/s for blocks bigger than 5
			
 
				-MB. Double buffering is therefore a viable solution for very large data
			
 
				-transfers, where throughput performance is favoured over latency. For data sizes
			
 
				-less than 95 MB, we can determine all addresses before the actual transfers thus
			
 
				-keeping the CPU out of the transfer loop.
			
 
				+Due to hardware restrictions the largest possible GPU buffer sizes are about
			
 
				+95 MB but larger transfers can be achieved by using a double buffering
			
 
				+mechanism: data are copied from the buffer exposed to the FPGA into a
			
 
				+different location in GPU memory. To verify that we can keep up with the
			
 
				+incoming data throughput using this strategy, we measured the data throughput
			
 
				+within a GPU by copying data from a smaller sized buffer representing the DMA
			
 
				+buffer to a larger destination buffer. At a block size of about 384 KB the
			
 
				+throughput surpasses the maximum possible PCIe bandwidth, and it reaches 40
			
 
				+GB/s for blocks bigger than 5 MB. Double buffering is therefore a viable
			
 
				+solution for very large data transfers, where throughput performance is
			
 
				+favoured over latency. For data sizes less than 95 MB, we can determine all
			
 
				+addresses before the actual transfers thus keeping the CPU out of the transfer
			
 
				+loop.
			
 
				 
			
 
				 %% Ufo Framework
			
 
				 To process the data, we encapsulated the DMA setup and memory mapping in a
			
@@ -318,22 +311,24 @@ in~\cite{rota2015dma}.
 
				   \includegraphics[width=0.85\textwidth]{figures/throughput}
			
 
				   \caption{%
			
 
				     Measured throughput for data transfers from FPGA to main memory
			
 
				-    (CPU) and from FPGA to the global GPU memory (GPU).
			
 
				+    (CPU) and from FPGA to the global GPU memory (GPU) using Setup 1.
			
 
				 }
			
 
				 \label{fig:throughput}
			
 
				 \end{figure}
			
 
				 
			
 
				-In order to evaluate the maximum performance of the DMA engine, measurements of pure
			
 
				-data throughput were carried out using Setup 1. The results are shown in
			
 
				-\figref{fig:throughput} for transfers to the system's main memory as well as
			
 
				-to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
			
 
				+In order to evaluate the maximum performance of the DMA engine, measurements
			
 
				+of pure data throughput were carried out using Setup 1. The results are shown
			
 
				+in \figref{fig:throughput} for transfers to the system's main memory as well
			
 
				+as to the global memory. For FPGA-to-GPU data transfers bigger than 95 MB, the
			
 
				 double buffering mechanism was used. As one can see, in both cases the write
			
 
				 performance is primarily limited by the PCIe bus. Up until 2 MB data transfer
			
 
				 size, the throughput to the GPU is approaching slowly 100 MB/s. From there on,
			
 
				 the throughput increases up to 6.4 GB/s at about 1 GB data size. The CPU
			
 
				 throughput saturates earlier and the maximum throughput is 6.6 GB/s. The slope
			
 
				 and maximum performance depend on the different implementation of the
			
 
				-handshaking sequence between DMA engine and the hosts.
			
 
				+handshaking sequence between DMA engine and the hosts. With Setup 2, the PCIe
			
 
				+Gen1 link limits the throughput to system main memory to around 700 MB/s.
			
 
				+However, transfers to GPU memory yielded the same results as Setup 1.
			
 
				 
			
 
				 %% --------------------------------------------------------------------------
			
 
				 \subsection{Latency}
			
@@ -367,32 +362,31 @@ We conducted the following test in order to measure the latency introduced by th
 
				 3) when all the data has been transferred, the DMA engine notifies the host that new data is present by writing into a specific address in the system main memory.
			
 
				 4) the host acknowledges that data has been received by issuing the the \emph{stop\_dma} command.
			
 
				 
			
 
				-The correct ordering of the packets is assured by the PCIe protocol. 
			
 
				 A counter on the FPGA measures the time interval between the \emph{start\_dma}
			
 
				 and \emph{stop\_dma} commands with a resolution of 4 ns, therefore measuring
			
 
				-the round-trip latency of the system. The round-trip latencies for data
			
 
				-transfers to system main memory and GPU memory are shown in
			
 
				-\figref{fig:latency}.
			
 
				-
			
 
				-When system main memory is used,
			
 
				-latencies as low as 1.1 \textmu s are achieved with Setup 1 for a packet size
			
 
				-of 1024 B. The higher latency and the dependance on size measured with Setup 2
			
 
				-are caused by the slower PCIe x4 Gen1 link connecting the FPGA board to the system main memory.
			
 
				-
			
 
				-The same test was performed when transferring data inside GPU memory, but also
			
 
				-in this case the notification is written to systen main memory. This approach
			
 
				-was used because the latency introduced by OpenCL scheduling (\~ 100 \textmu
			
 
				-s) does not allow for a direct measurement based only on DirectGMA
			
 
				-communication. When connecting the devices to the same RC, as in Setup 2, a
			
 
				-latency of 2 \textmu is achieved (limited by the latency to system main
			
 
				-memory, as seen in \figref{fig:latency}.a. On the contrary, if the FPGA board
			
 
				-and the GPU are connected to different RC as in Setup 1, the latency increases
			
 
				-significantly. It must be noted that the low latencies measured with Setup 1
			
 
				-for packet sizes below 1 kB seem to be due to a caching mechanism  inside the
			
 
				-PCIe switch, and it is not clear whether data has been successfully written
			
 
				-into GPU memory when the notification is delivered to the CPU. This effect
			
 
				-must be taken into account for future implementations as it could potentially
			
 
				-lead to data corruption.
			
 
				+the round-trip latency of the system. The correct ordering of the packets is
			
 
				+assured by the PCIe protocol. The measured round-trip latencies for data transfers to
			
 
				+system main memory and GPU memory are reported in \figref{fig:latency}.
			
 
				+
			
 
				+When system main memory is used, latencies as low as 1.1 \textmu s are
			
 
				+achieved with Setup 1 for a packet size of 1024 B. The higher latency and the
			
 
				+dependance on size measured with Setup 2 are caused by the slower PCIe x4 Gen1
			
 
				+link connecting the FPGA board to the system main memory.
			
 
				+
			
 
				+The same test was performed when transferring data inside GPU memory. Like in
			
 
				+the previous case, the notification was written into systen main memory. This
			
 
				+approach was used because the latency introduced by OpenCL scheduling in our
			
 
				+implementation (\~ 100-200 \textmu s) did not allow a precise measurement
			
 
				+based only on FPGA-GPU communication. When connecting the devices to the same
			
 
				+RC, as in Setup 2, a latency of 2 \textmu is achieved (limited by the latency
			
 
				+to system main memory, as seen in \figref{fig:latency}.a). On the contrary, if
			
 
				+the FPGA board and the GPU are connected to different RC as in Setup 1, the
			
 
				+latency increases significantly with packet size. It must be noted that the
			
 
				+low latencies measured with Setup 1 for packet sizes below 1 kB seem to be due
			
 
				+to a caching mechanism  inside the PCIe switch, and it is not clear whether
			
 
				+data has been successfully written into GPU memory when the notification is
			
 
				+delivered to the CPU. This effect must be taken into account in future
			
 
				+implementations as it could potentially lead to data corruption.
			
 
				  
			
 
				 \section{Conclusion and outlook}