|
@@ -162,6 +162,15 @@ friendly interfaces with the custom logic with an input bandwidth of 7.45
|
|
|
GB/s. The user logic and the DMA engine are configured by the host through PIO
|
|
|
registers.
|
|
|
|
|
|
+\begin{figure}[t]
|
|
|
+ \centering
|
|
|
+ \includegraphics[width=0.5\textwidth]{figures/fpga-arch}
|
|
|
+ \caption{%
|
|
|
+ FPGA AAA
|
|
|
+ }
|
|
|
+ \label{fig:fpga-arch}
|
|
|
+\end{figure}
|
|
|
+
|
|
|
The physical addresses of the host's memory buffers are stored into an internal
|
|
|
memory and are dynamically updated by the driver or user, allowing highly
|
|
|
efficient zero-copy data transfers. The maximum size associated with each
|
|
@@ -248,15 +257,38 @@ Python.
|
|
|
|
|
|
\section{Results}
|
|
|
|
|
|
-We carried out performance measurements on a machine with an Intel Xeon
|
|
|
-E5-1630 at 3.7 GHz, Intel C612 chipset running openSUSE 13.1 with Linux
|
|
|
-3.11.10. The Xilinx VC709 evaluation board was plugged into one of the PCIe
|
|
|
-3.0 x8 slots. In case of FPGA-to-CPU data transfers, the software
|
|
|
-implementation is the one described in~\cite{rota2015dma}.
|
|
|
+We carried out performance measurements on two different setups, described in
|
|
|
+table~\ref{table:setups}. In Setup 2, a low-end Supermicro X7SPA-HF-D525
|
|
|
+system was connected to a Netstor NA255A external PCIe enclosure. In both
|
|
|
+cases, a Xilinx VC709 evaluation board was plugged into a PCIe 3.0 x8 slots.
|
|
|
+In case of FPGA-to-CPU data transfers, the software implementation is the one
|
|
|
+described in~\cite{rota2015dma}.
|
|
|
|
|
|
-\subsection{Throughput}
|
|
|
+\begin{table}[b]
|
|
|
+\centering
|
|
|
+\caption{Hardware used for throughput and latency measurements}
|
|
|
+\label{table:setups}
|
|
|
+\begin{tabular}{@{}llll@{}}
|
|
|
+ \toprule
|
|
|
+Component & Setup 1 & Setup 2 \\
|
|
|
+ \midrule
|
|
|
+CPU & Intel Xeon E5-1630 at 3.7 GHz & Intel Atom D525 \\
|
|
|
+Chipset & Intel C612 & Intel ICH9R Express \\
|
|
|
+GPU & AMD FirePro W9100 & AMD FirePro W9100 \\
|
|
|
+PCIe link (FPGA-System memory) & x8 Gen3 & x4 Gen1 \\
|
|
|
+PCIe link (FPGA-GPU) & x8 Gen3 & x8 Gen3 \\
|
|
|
+ \bottomrule
|
|
|
+\end{tabular}
|
|
|
+\end{table}
|
|
|
|
|
|
|
|
|
+\subsection{Throughput}
|
|
|
+
|
|
|
+% We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
|
|
|
+% system based on an Intel Atom CPU. The results showed no significant difference
|
|
|
+% compared to the previous setup. Depending on the application and computing
|
|
|
+% requirements, this result makes smaller acquisition system a cost-effective
|
|
|
+% alternative to larger workstations.
|
|
|
|
|
|
\begin{figure}
|
|
|
\includegraphics[width=\textwidth]{figures/throughput}
|
|
@@ -267,31 +299,6 @@ implementation is the one described in~\cite{rota2015dma}.
|
|
|
\label{fig:throughput}
|
|
|
\end{figure}
|
|
|
|
|
|
-% \begin{figure}
|
|
|
-% \centering
|
|
|
-% \begin{subfigure}[b]{.49\textwidth}
|
|
|
-% \centering
|
|
|
-% \includegraphics[width=\textwidth]{figures/throughput}
|
|
|
-% \caption{%
|
|
|
-% DMA data transfer throughput.
|
|
|
-% }
|
|
|
-% \label{fig:throughput}
|
|
|
-% \end{subfigure}
|
|
|
-% \begin{subfigure}[b]{.49\textwidth}
|
|
|
-% \includegraphics[width=\textwidth]{figures/latency}
|
|
|
-% \caption{%
|
|
|
-% Latency distribution.
|
|
|
-% % for a single 4 KB packet transferred
|
|
|
-% % from FPGA-to-CPU and FPGA-to-GPU.
|
|
|
-% }
|
|
|
-% \label{fig:latency}
|
|
|
-% \end{subfigure}
|
|
|
-% \caption{%
|
|
|
-% Measured throuhput for data transfers from FPGA to main memory
|
|
|
-% (CPU) and from FPGA to the global GPU memory (GPU).
|
|
|
-% }
|
|
|
-% \end{figure}
|
|
|
-
|
|
|
The measured results for the pure data throughput is shown in
|
|
|
\figref{fig:throughput} for transfers from the FPGA to the system's main
|
|
|
memory as well as to the global memory as explained in \ref{sec:host}.
|
|
@@ -304,11 +311,7 @@ is approaching slowly 100 MB/s. From there on, the throughput increases up to
|
|
|
6.4 GB/s when PCIe bus saturation sets in at about 1 GB data size. The CPU
|
|
|
throughput saturates earlier but the maximum throughput is 6.6 GB/s.
|
|
|
|
|
|
-% We repeated the FPGA-to-GPU measurements on a low-end Supermicro X7SPA-HF-D525
|
|
|
-% system based on an Intel Atom CPU. The results showed no significant difference
|
|
|
-% compared to the previous setup. Depending on the application and computing
|
|
|
-% requirements, this result makes smaller acquisition system a cost-effective
|
|
|
-% alternative to larger workstations.
|
|
|
+
|
|
|
|
|
|
% \begin{figure}
|
|
|
% \includegraphics[width=\textwidth]{figures/intra-copy}
|
|
@@ -340,14 +343,20 @@ latency.
|
|
|
|
|
|
|
|
|
\subsection{Latency}
|
|
|
-
|
|
|
-\begin{figure}
|
|
|
- \includegraphics[width=\textwidth]{figures/latency-hist}
|
|
|
- \caption{%
|
|
|
- Latency distribution for a single 1024 B packet transferred from FPGA to
|
|
|
- GPU memory and to main memory.
|
|
|
- }
|
|
|
- \label{fig:latency-distribution}
|
|
|
+\begin{figure}[t]
|
|
|
+ \centering
|
|
|
+ \begin{subfigure}[b]{.8\textwidth}
|
|
|
+ \centering
|
|
|
+ \includegraphics[width=\textwidth]{figures/latency}
|
|
|
+ \caption{Latency }
|
|
|
+ \label{fig:latency_vs_size}
|
|
|
+ \end{subfigure}
|
|
|
+ \begin{subfigure}[b]{.8\textwidth}
|
|
|
+ \includegraphics[width=\textwidth]{figures/latency-hist}
|
|
|
+ \caption{Latency distribution.}
|
|
|
+ \label{fig:latency_hist}
|
|
|
+ \end{subfigure}
|
|
|
+ \label{fig:latency}
|
|
|
\end{figure}
|
|
|
|
|
|
For HEP experiments, low latencies are necessary to react in a reasonable time
|