Motivation/Problem Current generation GPUs are capable of processing several TFLOP/s which causes I/O bottlenecks in applications with large bandwidth and low computational requirements. Moreover, applications that process data from external sources such as a frontend FPGA are affected twice by this problem because data first has to be transferred into main system memory via CPU transfers before being moved to the GPU for final operation in a second transfer. Method/solution To remedy this problem, we designed and implemented a system architecture comprising a custom FPGA board with a flexible DMA transfer policy and a heterogeneous compute framework receiving data using AMD's DirectGMA OpenCL extension. Results Conclusion With our proposed system architecture we are able to sustain the bandwidth requirements of various applications such as real-time tomographic image reconstruction and signal analysis with a peak FPGA-GPU throughput of XXX GB/s.