8 years ago · 5739cb477d
--- a/data/latency-hist.py
+++ b/data/latency-hist.py
@@ -17,8 +17,8 @@ plt.rc('font', **dict(family='serif'))
 
				 plt.figure(figsize=(4, 3))
			
 
				 
			
 
				 # divide by 2 for one-way latency
			
 
				-plt.hist(gpu_data / 2, bins=100, normed=False, label='CPU')
			
 
				-plt.hist(cpu_data / 2, bins=100, normed=False, label='GPU')
			
 
				+plt.hist(gpu_data / 2, bins=100, normed=False, label='GPU')
			
 
				+plt.hist(cpu_data / 2, bins=100, normed=False, label='CPU')
			
 
				 
			
 
				 plt.xlabel(u'Latency in \u00b5s')
			
 
				 plt.ylabel('Frequency')
			
--- a/paper.tex
+++ b/paper.tex
@@ -33,11 +33,12 @@
 
				   performance computing applications. To connect a fast data acquisition stage
			
 
				   with a GPU's processing power, we developed an architecture consisting of a
			
 
				   FPGA that includes a Direct Memory Access (DMA) engine compatible with the
			
 
				-  Xilinx PCI-Express core, a Linux driver for register access and high-level
			
 
				+  Xilinx PCI-Express core, a Linux driver for register access, and high-level
			
 
				   software to manage direct memory transfers using AMD's DirectGMA technology.
			
 
				   Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
			
 
				   implementation is suitable for real-time DAQ system applications ranging
			
 
				-  photon science and medical imaging to HEP experiment triggers.
			
 
				+  from photon science and medical imaging to High Energy Physics (HEP) 
			
 
				+  trigger systems.
			
 
				 }
			
 
				 
			
 
				 
			
@@ -70,12 +71,12 @@ performance. In particular, latency becomes the most stringent specification if
 
				 a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
			
 
				 
			
 
				 To address these problems we propose a complete hardware/software stack
			
 
				-architecture based on our own Direct Memory Access (DMA) design and integration
			
 
				+architecture based on our own DMA design, and integration
			
 
				 of AMD's DirectGMA technology into our processing pipeline. In our solution,
			
 
				 PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
			
 
				 host computer. Due to its high bandwidth and modularity, PCIe quickly became the
			
 
				 commercial standard for connecting high-throughput peripherals such as GPUs or
			
 
				-solid state disks. Optical PCIe networks have been demonstrated
			
 
				+solid state disks. Moreover, optical PCIe networks have been demonstrated
			
 
				 a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe
			
 
				 as a communication bus over long distances. In particular, in HEP DAQ systems,
			
 
				 optical links are preferred over electrical ones because of their superior
			
@@ -274,7 +275,7 @@ conducted the following protocol: 1) the host issues continuous data transfers
 
				 of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
			
 
				 \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
			
 
				 input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
			
 
				-engine thus pushing back the data back to the GPU. 3) At some point, the host
			
 
				+engine thus pushing back the data to the GPU. 3) At some point, the host
			
 
				 enables generation of data different from initial value which also starts an
			
 
				 internal FPGA counter with 4 ns resolution. 4) When the generated data is
			
 
				 received again at the FPGA, the counter is stopped. 5) The host program reads
			
@@ -293,11 +294,11 @@ We developed a complete hardware and software solution that enables DMA
 
				 transfers between FPGA-based readout boards and GPU computing clusters with
			
 
				 reasonable performance characteristics. The net throughput is primarily limited
			
 
				 by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
			
 
				-CPU-based data transfer. Moreover, by writing directly into GPU memory instead
			
 
				+CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
			
 
				 of routing data through system main memory, the overall latency can be reduced
			
 
				 by a factor of two allowing close massively parallel computation on GPUs.
			
 
				 Moreover, the software solution that we proposed allows seamless multi-GPU
			
 
				-processing of the incoming data due to the integration in our streamed computing
			
 
				+processing of the incoming data, due to the integration in our streamed computing
			
 
				 framework. This allows straightforward integration with different DAQ systems
			
 
				 and introduction of custom data processing algorithms.
			
 
				 
			
@@ -316,7 +317,7 @@ a single x16 device by using an external PCIe switch. With two cores operating
 
				 in parallel, we foresee an increase in the data throughput by a factor of 2 (as
			
 
				 demonstrated in~\cite{rota2015dma}). Further improvements are expected by
			
 
				 generalizing the transfer mechanism and include Infiniband support besides the
			
 
				-existing PCIe connect. This allows speeds of up to 290 Gbit/s and latencies as
			
 
				+existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
			
 
				 low as 0.5 \textmu s.
			
 
				 
			
 
				 Our goal is to develop a unique hybrid solution, based on commercial standards,