Table of Contents

Cover

Title

Preface

1 Introduction of Real-time Image Processing

1.1. General image processing presentation
1.2. Real-time image processing

2 Hardware Architectures for Real-time Processing

2.1. History of image processing hardware platforms
2.2. General-purpose processors
2.3. Digital signal processors
2.4. Graphics processing units
2.5. Field programmable gate arrays
2.6. SW/HW codesign of real-time image processing
2.7. Image processing development environment description
2.8. Comparison and discussion

3 Rapid Prototyping of Parallel Reconfigurable Instruction Set Processor for Efficient Real-Time Image Processing

3.1. Context and problematic
3.2. Related works
3.3. Design exploration framework
3.4. Case study: RISP conception and synthesis for spatial transforms
3.5. Hardware implementation of spatial transforms on an FPGA-based platform
3.6. Discussion and conclusion

4 Exploration of High-Level Synthesis Technique

4.1. Introduction of HLS technique
4.2. Vivado_HLS process presentation
4.3. Case of HLS application: FPGA implementation of an improved skin lesion assessment method
4.4. Discussion

5 CDMS4HLS: A Novel Source-To-Source Compilation Strategy for HLS-Based FPGA Design

5.1. S2S compiler-based HLS design framework
5.2. CDMS4HLS compilation process description
5.3. CDMS4HLS compilation process evaluation
5.4. Discussion

6 Embedded Implementation of VHR Satellite Image Segmentation

6.1. LSM description
6.2. Implementation and optimization presentation
6.3. Experiment evaluation
6.4. Discussion and conclusion

7 Real-time Image Processing with Very High-level Synthesis

7.1. VHLS motivation
7.2. Image processing from Matlab to FPGA-RTL
7.3. VHLS process presentation
7.4. VHLS implementation issues
7.5. Future work for real-time image processing with VHLS

Bibliography

Index

End User License Agreement

List of Tables

2 Hardware Architectures for Real-time Processing

Table 2.1. Compatibility evaluation of fundamental development environments for different hardware devices

3 Rapid Prototyping of Parallel Reconfigurable Instruction Set Processor for Efficient Real-Time Image Processing

Table 3.1. IR assembly and parallelism extraction results: acceleration corresponds to number of sequential cycle per number of parallel cycle for one block of 2D-DCT processing
Table 3.2. Used hardware resources and operating frequency of each RISP processor
Table 3.3. Macroblock distribution of four video sequences using the HEVC decoding process
Table 3.4. IDCT processing time in seconds for one frame of different video formats: these results were obtained using five soft-core processors (four RISPs and one MicroBlaze). Bold values indicate that the processing speed is equal or superior to 25 frames/s
Table 3.5. IDCT processing used hardware resources and operating frequency: these results were obtained by using five soft-core processors (four RISPs and one MicroBlaze)

4 Exploration of High-Level Synthesis Technique

Table 4.1. Characteristic evaluation of five HLS tools
Table 4.2. Operations-cores mapping of the scheduling schematic in Figure 4.5
Table 4.3. Size of search spaces for skin parameters (see [JOL 13])
Table 4.4. Necessary instruction number comparison between original and optimized KM functions
Table 4.5. Population parameter configuration for KMGA and HCR-KMGA
Table 4.6. Implementation acceleration ratio comparison: the clock of CPU and FPGA are, respectively, 2.4 GHz and 50 MHz
Table 4.7. Used hardware resources estimation of FPGA-KMGA and FPGA-HCR-KMGA implementations on Virtex7-XC7VX1140T of Xilinx

5 CDMS4HLS: A Novel Source-To-Source Compilation Strategy for HLS-Based FPGA Design

Table 5.1. Symbolic expression manipulation strategies
Table 5.2. Performances and resources consumption evaluation (FI, function inline; LM, loop manipulation; SEM, symbolic expression manipulation; LU, loop unwinding; MM, memory manipulation)

6 Embedded Implementation of VHR Satellite Image Segmentation

Table 6.1. Optimization evaluation: original corresponds to fpga_hls_ori implementation while LU corresponds to fpga_hls_opt implementation
Table 6.2. Running time comparison (s/pixel)

List of Figures

1 Introduction of Real-time Image Processing

Figure 1.1. Overview of the typical image acquisition process (see [MOE 12])
Figure 1.2. Lena (gray-level image) and landscape (color image). For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 1.3. Two multispectral image cubes. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 1.4. Diversity of operations in image processing: typical processing chain (top) and decrease in amount of data across processing chain (bottom)

2 Hardware Architectures for Real-time Processing

Figure 2.1. Typical architecture of GPP (left) and GPP’s processing core (right)
Figure 2.2. Typical Harvard architecture and block diagram of VLIW architecture
Figure 2.3. Typical architecture of GPU in comparison with CPU. DRAM, dynamic random access memory; ALU, arithmetic logic unit
Figure 2.4. Typical architecture of FPGA
Figure 2.5. Typical SW/HW codesign framework depending task partition modes: horizontal partition flow (left) and vertical partition flow (right)
Figure 2.6. Design times versus application performance with Register Transfer Level (RTL) design entry (see [XIL 13a])

3 Rapid Prototyping of Parallel Reconfigurable Instruction Set Processor for Efficient Real-Time Image Processing

Figure 3.1. Overview of the rapid prototyping framework
Figure 3.2. Illustration of the generic kernel structure of RISP
Figure 3.3. Eight-point DCT partial butterfly flow graph
Figure 3.4. Parallel execution graph generated for the 1D-DCT partial butterfly operation: each operation is realized using two operands from two registers and memorizes the result in a register. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 3.5. Necessary cycle number of each RISP to perform spatial transform of one block. Four RISPs correspond to MX, R4, R8 and PB. Block size is equal to 4 × 4, 8 × 8, 16 ×16 and 32 × 32. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 3.6. Products of normalized slice numbers and normalized cycle numbers: a small product value means a better area/speed ratio. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 3.7. Four classical test video sequences: Foreman (CIF), Race horse (WQVGA), Tennis (HDTV) and Akiyo (QCIF). For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 3.8. Block diagram of hardware implementations of spatial transform on the ML605 evaluation board

4 Exploration of High-Level Synthesis Technique

Figure 4.1. Comparison of RTL- and HLS-based design flows by using Gasjki-Kuhn’s Y-chart: full lines indicate the automated cycles, while dotted lines the manual cycles
Figure 4.2. Design time versus application performance with Vivado_HLS compiler
Figure 4.3. HLS control and datapath extraction example: a) input C code source, b) control extraction, c) operation extraction and d) generated control and datapath behavior. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.4. HLS scheduling and binding flow
Figure 4.5. Scheduling of the example in Figure 4.3(a). For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.6. ASCLEPIOS system illustration. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.7. Reflectance spectrums S_reflectance at a single pixel formed from the reflectance measured: blue and red represent the pixel’s position and green represents the different wavelength values. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.8. Population data structure: the skin parameters are f_mel, D_epi, f_blood, C_oxy and D_dermis. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.9. Overall genetic algorithm procedure for KM model inversion
Figure 4.10. Overall architecture of the prediction function optimization algorithm (PFOA)
Figure 4.11. Search space prediction of PFOA: x_n and f_n are the parameter and fitness value of the nth iteration’s best individual, (x₀’, f₀’) is the local optima and (x₀, f₀) is the global optima of the optimization function
Figure 4.12. Relationships of individual date
Figure 4.13. Development process for the HCR-KMGA’s FPGA implementation
Figure 4.14. Software implementation of HCR-KMGA algorithm for the FPGA device
Figure 4.15. N-threads KMGA architecture using POSIX threads: S_{reflectance_n} is the reflectance spectrum at (_{in, jn}) in the work area of the nth thread and P_{pixel_n} is its retrieved skin parameters. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 4.16. Multispectral image measured by ASCLEPIOS from 450 to 780 nm with a step of 10 nm (top), and simulation results of five maps obtained by KMGA (middle) and by HCR-KMGA (bottom). These five maps (from left to right) consist of melanin concentration, epidermis thickness, volume blood fraction, oxygen saturation and dermis thickness. For a color version of the figure, see www.iste.co.uk/li/image.zip

5 CDMS4HLS: A Novel Source-To-Source Compilation Strategy for HLS-Based FPGA Design

Figure 5.1. Manual and source-to-source compiler-based HLS design framework
Figure 5.2. CDMS4HLS compilation process
Figure 5.3. Comparison between the sources code before and after function inline: it is assumed that opt_1 and opt_2 consume one cycle and opt_3 two cycles, and the operations of func_A and func_B are referred to by the color of red and blue, respectively. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 5.4. CDMS4HLS loop manipulation illustration. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 5.5. Array transformation: [] refers to arrays and < > refers to the vector formed by multiple elements
Figure 5.6. Implementing flow with different code optimization methods, including PolyComp, manual directive configuration within Vivado_HLS and CDMS4HLS
Figure 5.7. Latency speedup comparison. For a color version of the figure, see www.iste.co.uk/li/image.zip

6 Embedded Implementation of VHR Satellite Image Segmentation

Figure 6.1. The D2Q5 lattice Boltzmann method (LBM) model
Figure 6.2. HLS-based design flow: “*” refers to the necessary manual cycles
Figure 6.3. Block hierarchy comparison: a) original version, b) debugged version and c) function inline version. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 6.4. Finite state machine (FSM) transition within loop manipulation: “L*” refers to the operations covered in the present state, in which “*” indicates the line number in Algorithm 6.1, W × L = 948 × 450 is the image dimension and ITS is the maximum iteration number defined as five
Figure 6.5. Scheduling of pow4_hls: a) original-to-symbol expression manipulation (SEM) code transformation; b) scheduling comparison
Figure 6.6. Impact of the parameter α on the accuracy of the segmentation results (see [BAL 14]). For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 6.7. Impact of the parameter β on the accuracy of the segmentation results (see [BAL 14]). For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 6.8. Original images and segmentation results: taken by the IKONOS satellite: a) Original image of Uxmal; b) original image of volcano; c) segmentation result of Uxmal and d) segmentation result of volcano. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 6.9. Original images and segmentation results: taken by the GeoEye-1 satellite: a) Original image of ice sheet; b) original image of Santorin; c) segmentation result of ice sheet and d) segmentation result of Santorin. For a color version of the figure, see www.iste.co.uk/li/image.zip
Figure 6.10. Running time and latency acceleration (expressed in LOG) improvement of different optimized implementations. For a color version of the figure, see www.iste.co.uk/li/image.zip

7 Real-time Image Processing with Very High-level Synthesis

Figure 7.1. Flowchart of the VHLS design
Figure 7.2. Source-to-source compilation strategies for VHLS
Figure 7.3. VHLS-based work flow

First published 2017 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk

John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com

The rights of Chao Li, Souleymane Balla-Arabe and Fan Yang-Song to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2017948985

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78630-094-2

Preface

In the image processing field, a lot of applications require real-time execution in the domains of medical technology, robotics and transmission, etc. Recently, real-time image processing fields have made a lot of progress. Technological developments allow engineers to integrate more complex algorithms with large data volumes onto embedded systems, and produce series of new sophisticated electronic architectures at an affordable price. At the same time, industrial and academic researchers have proposed new methods and provided new tools in order to facilitate real-time image processing realization at different levels. It is necessary to perform a deepened educational and synthetic survey on this topic. We will present electronic platforms, methods, and strategies to reach this objective.

This book consists of seven chapters ranging from the fundamental conceptual introduction of real-time image processing to future perspectives in this area. We describe hardware architectures and different optimization strategies for real-time purposes. The latter consists of a survey of software and hardware co-design tools at different levels. Two real-time applications will be presented in detail in order to illustrate the proposed approaches.

The major originalities of this book include (1) algorithm architecture mapping: we select methods and tools that treat simultaneously the application and its electronic platform in order to perform fast and optimal design space exploration (DSE), (2) each approach will be illustrated by concrete examples and (3) two of the chosen algorithms have been only recently advanced in their domain.

This book is written primarily for those who are familiar with the basics of image processing and want to implement the target image processing design using different electronic platforms for computing acceleration. It accomplishes this by presenting the techniques and approaches step by step, the algorithm and architecture conjointly, and by notions of description and example illustration. This concerns both the software engineer and the hardware engineer.

This book will also be adequate for those who are familiar with programming and applying embedded systems to other problems and are considering image processing applications. Much of the focus and many of the examples are taken from image processing applications. Sufficient detail is given to make algorithms and their implementation clear.

Chao LI
Souleymane BALLA-ARABE
Fan YANG
August 2017