Motivation/Problem

Current generation GPUs are capable of processing several TFLOP/s which causes
I/O bottlenecks in applications with large bandwidth and low computational
requirements. Moreover, applications that process data from external sources
such as a frontend FPGA are affected twice by this problem because data first
has to be transferred into main system memory via CPU transfers before being
moved to the GPU for final operation in a second transfer.

Method/solution

To remedy this problem, we designed and implemented a system architecture
comprising a custom FPGA board with a flexible DMA transfer policy and a
heterogeneous compute framework receiving data using AMD's DirectGMA
OpenCL extension.

Results

Conclusion

With our proposed system architecture we are able to sustain the bandwidth
requirements of various applications such as real-time tomographic image
reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s.