Browse Source

latency graph lable correction, some commas and minor word changes

Luis Ardila 8 years ago
parent
commit
5739cb477d
2 changed files with 11 additions and 10 deletions
  1. 2 2
      data/latency-hist.py
  2. 9 8
      paper.tex

+ 2 - 2
data/latency-hist.py

@@ -17,8 +17,8 @@ plt.rc('font', **dict(family='serif'))
 plt.figure(figsize=(4, 3))
 
 # divide by 2 for one-way latency
-plt.hist(gpu_data / 2, bins=100, normed=False, label='CPU')
-plt.hist(cpu_data / 2, bins=100, normed=False, label='GPU')
+plt.hist(gpu_data / 2, bins=100, normed=False, label='GPU')
+plt.hist(cpu_data / 2, bins=100, normed=False, label='CPU')
 
 plt.xlabel(u'Latency in \u00b5s')
 plt.ylabel('Frequency')

+ 9 - 8
paper.tex

@@ -33,11 +33,12 @@
   performance computing applications. To connect a fast data acquisition stage
   with a GPU's processing power, we developed an architecture consisting of a
   FPGA that includes a Direct Memory Access (DMA) engine compatible with the
-  Xilinx PCI-Express core, a Linux driver for register access and high-level
+  Xilinx PCI-Express core, a Linux driver for register access, and high-level
   software to manage direct memory transfers using AMD's DirectGMA technology.
   Measurements with a Gen3 x8 link shows a throughput of up to 6.4 GB/s. Our
   implementation is suitable for real-time DAQ system applications ranging
-  photon science and medical imaging to HEP experiment triggers.
+  from photon science and medical imaging to High Energy Physics (HEP) 
+  trigger systems.
 }
 
 
@@ -70,12 +71,12 @@ performance. In particular, latency becomes the most stringent specification if
 a time-deterministic feedback is required, \emph{e.g.} Low/High-level Triggers.
 
 To address these problems we propose a complete hardware/software stack
-architecture based on our own Direct Memory Access (DMA) design and integration
+architecture based on our own DMA design, and integration
 of AMD's DirectGMA technology into our processing pipeline. In our solution,
 PCI-express (PCIe) has been chosen as a data link between FPGA boards and the
 host computer. Due to its high bandwidth and modularity, PCIe quickly became the
 commercial standard for connecting high-throughput peripherals such as GPUs or
-solid state disks. Optical PCIe networks have been demonstrated
+solid state disks. Moreover, optical PCIe networks have been demonstrated
 a decade ago~\cite{optical_pcie}, opening the possibility of using PCIe
 as a communication bus over long distances. In particular, in HEP DAQ systems,
 optical links are preferred over electrical ones because of their superior
@@ -274,7 +275,7 @@ conducted the following protocol: 1) the host issues continuous data transfers
 of a 4 KB buffer that is initialized with a fixed value to the FPGA using the
 \texttt{cl\-Enqueue\-Copy\-Buffer} call. 2) when the FPGA receives data in its
 input FIFO it moves it directly to the output FIFO which feeds the outgoing DMA
-engine thus pushing back the data back to the GPU. 3) At some point, the host
+engine thus pushing back the data to the GPU. 3) At some point, the host
 enables generation of data different from initial value which also starts an
 internal FPGA counter with 4 ns resolution. 4) When the generated data is
 received again at the FPGA, the counter is stopped. 5) The host program reads
@@ -293,11 +294,11 @@ We developed a complete hardware and software solution that enables DMA
 transfers between FPGA-based readout boards and GPU computing clusters with
 reasonable performance characteristics. The net throughput is primarily limited
 by the PCIe bus, reaching 6.4 GB/s for a 256 B payload and surpassing our
-CPU-based data transfer. Moreover, by writing directly into GPU memory instead
+CPU-based data transfer. Furthermore, by writing directly into GPU memory instead
 of routing data through system main memory, the overall latency can be reduced
 by a factor of two allowing close massively parallel computation on GPUs.
 Moreover, the software solution that we proposed allows seamless multi-GPU
-processing of the incoming data due to the integration in our streamed computing
+processing of the incoming data, due to the integration in our streamed computing
 framework. This allows straightforward integration with different DAQ systems
 and introduction of custom data processing algorithms.
 
@@ -316,7 +317,7 @@ a single x16 device by using an external PCIe switch. With two cores operating
 in parallel, we foresee an increase in the data throughput by a factor of 2 (as
 demonstrated in~\cite{rota2015dma}). Further improvements are expected by
 generalizing the transfer mechanism and include Infiniband support besides the
-existing PCIe connect. This allows speeds of up to 290 Gbit/s and latencies as
+existing PCIe connection. This allows speeds of up to 290 Gbit/s and latencies as
 low as 0.5 \textmu s.
 
 Our goal is to develop a unique hybrid solution, based on commercial standards,