. 2019 Jan 13;5(1):16.

doi: 10.3390/jimaging5010016.

FPGA-Based Processor Acceleration for Image Processing Applications

Fahad Siddiqui¹, Sam Amiri², Umar Ibrahim Minhas¹, Tiantai Deng¹, Roger Woods¹, Karen Rafferty¹, Daniel Crookes¹

Affiliations

¹ School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast BT7 1NN, UK.
² School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK.

PMID: 34465705
PMCID: PMC8320866
DOI: 10.3390/jimaging5010016

FPGA-Based Processor Acceleration for Image Processing Applications

Fahad Siddiqui et al. J Imaging. 2019.

. 2019 Jan 13;5(1):16.

doi: 10.3390/jimaging5010016.

Authors

Fahad Siddiqui¹, Sam Amiri², Umar Ibrahim Minhas¹, Tiantai Deng¹, Roger Woods¹, Karen Rafferty¹, Daniel Crookes¹

Affiliations

¹ School of Electronics, Electrical Engineering and Computer Science, Queen's University Belfast, Belfast BT7 1NN, UK.
² School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK.

PMID: 34465705
PMCID: PMC8320866
DOI: 10.3390/jimaging5010016

Abstract

FPGA-based embedded image processing systems offer considerable computing resources but present programming challenges when compared to software systems. The paper describes an approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based programming environment. The approach is demonstrated for a k-means clustering operation and a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using 16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU respectively.

Keywords: FPGA; hardware acceleration; heterogeneous computing; image processing; processor architectures.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth and computation improves as we near the datapath parts of the FPGA.

**Figure 2**
Illustration of possible data and task parallel decomposition of a dataflow algorithm found in image processing designs where the numerous of rows indicate the level of parallelism.

**Figure 3**
A brief description of the design flow of a hardware and software heterogeneous system highlighting key features. More detail of the flow is contained in reference [11].

**Figure 4**
(a) Impact of DSP48E1 configurations on maximum achievable clock frequency using different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with (PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector (MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG) (b) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7 and Virtex-7 FPGAs for single and true-dual port RAM configurations.

**Figure 5**
A range of dataflow models taken from [24,25]. (a) DFG node without internal storage called configuration ①; (b) DFG actor without internal storage t1 and constant i called configuration ②; (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called configuration ③.

**Figure 6**
FPGA datapath models resulting from Figure 5. (a) Programmable ALU corresponding to configuration ①; (b) Fine-grained processor corresponding to configuration ②; (c) Coarse-grained processor corresponding to configuration ③.

**Figure 7**
Impact of the various datapath models ①, ②, ③ on $f_{m a x}$ across Xilinx Artix-7, Kintex-7 and Virtex-7 FPGA families.

**Figure 8**
Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.

**Figure 9**
System architecture of IPPro-based hardware acceleration highlighting data distribution and control infrastructure, FIFO configuration and Finite-State-Machine control.

**Figure 10**
High-level implementation of k-means clustering algorithm: (a) Graphical view of Orcc dataflow network; (b) Part of dataflow network including the connections; (c) Part of Distance.cal file showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel, processed and sent to an output FIFO channel; (d) Compiled IPPro assembly code of Distance.cal.

**Figure 11**
IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism on area and performance based on Single core IPPro ①, eight-way parallel SIMD IPPro ②, parallel Dual core IPPro ③ and combined Dual core 8-way SIMD IPPro called ④.

**Figure 12**
Section execution times and ratios for each stage of the traffic sign recognition algorithm.

**Figure 13**
(a) The simplified IPPro assembly code of 3 × 3 dilation operation. (b) The output result of implemented design.

**Figure 14**
Stage-wise comparison of traffic sign recognition acceleration using ARM and IPPro based approach.

See this image and copyright information in PMC

References

1. Conti F., Rossi D., Pullini A., Loi I., Benini L. PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision. J. Signal Process. Syst. 2016;84:339–354. doi: 10.1007/s11265-015-1070-9. - DOI
1. Lamport L. The Parallel Execution of DO Loops. Commun. ACM. 1974;17:83–93. doi: 10.1145/360827.360844. - DOI
1. Markov I.L. Limits on Fundamental Limits to Computation. Nature. 2014;512:147–154. doi: 10.1038/nature13570. - DOI - PubMed
1. Bacon D.F., Rabbah R., Shukla S. FPGA Programming for the Masses. ACM Queue Mag. 2013;11:40–52. doi: 10.1145/2436256.2436271. - DOI
1. Gort M., Anderson J. Design re-use for compile time reduction in FPGA high-level synthesis flows; Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT); Shanghai, China. 10–12 December 2014; pp. 4–11.

Grants and funding

EP/K009583/1/Engineering and Physical Sciences Research Council

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FPGA-Based Processor Acceleration for Image Processing Applications

Affiliations

FPGA-Based Processor Acceleration for Image Processing Applications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous