A Optical Flow Estimation Processor With Image Tiling For Action Recognition In Mobile Devices
Abstract
A real-time optical flow estimation (OFE) processor is proposed for action recognition in mobile devices. The OFE requires a large number of external memory accesses (EMAs) and matrix computations, thus it is incompatible on mobile devices with real-time constraints. For the real-time OFE on mobile devices, this paper proposes two key features, both of which to reduce the required memory bandwidth and the number of computations:
- Tile-based hierarchical OFE enabling intermediate data to be processed within 303. 8 KB on-chip memory without external memory access,
- Early termination with background detection unit to eliminate redundant matrix computations for zero optical flow region.
Therefore, the proposed features reduce external memory bandwidth and computation by 99. 7 % and 50. 7 %, respectively. The proposed 13. 6 mm2 OFE processor is implemented in 65 nm CMOS technology, and it achieves the real-time OFE with 99. 4 frames-per-second (fps) throughput for an image resolution of QVGA (320×240) whose result can be successfully used for action recognition. Index Terms—Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC.
- INTRODUCTION
R ecently, action recognition has been highlighted as a next-generation user interface in mobile devices such as pet robots for human-computer interaction (HCI). The most important thing in action recognition on mobile devices is the high recognition accuracy, thus several previous works used deep neural networks with RGB images as their input. However, in order to achieve more robust action recognition performance on illumination changes and clutters in the background, we need motion information as well as RGB images. For example, action recognition accuracy is enhanced by about 14% when the motion information is used along with RGB images on a UCF-101 dataset.
In order to obtain the motion information, optical flow estimation (OFE) is broadly used. It is an algorithm which calculates a set of vectors which contain an object or pattern’s movement in the image sequence. Fig. 1 (a) shows the overall algorithm of the Brox Flow, which has shown that an OFE algorithm increases the action recognition accuracy. The OFE consists of two stages:
- Scale space generation and
- Flow generation. First, in the scale space generation stage, it generates multiple images in different scales (Gaussian Pyramid) to achieve scale invariance. Secondly, in the flow generation stage, conjugate gradient descent (CGD) algorithm estimates the final optical flow result by minimizing a cost function that is calculated by comparing the previous frame with the current frame for every scale in the pyramid.
However, it is difficult to implement real-time OFE in mobile devices due to massive amount of external memory access and OFE’s huge computational cost. As shown in Fig. 1 (b), for real-time (30fps) OFE with QVGA (320×240) images, 18. 1 GB/s external memory bandwidth and 35. 7 GFLOPS of computational throughput are required. Over 99. 3% of the external memory bandwidth and 85% of the computation are occupied by the CGD iterations in flow generation stage. It is due to the CGD, which performs a lot of large (over 320 × 240 for QVGA image) matrix multiplication for every iteration. In this paper, we propose a real-time OFE processor with two key features.
- Tile-based hierarchical optical flow estimation algorithm (THOFE) is proposed to mitigate external memory bandwidth requirement. The THOFE divides input images into several tiles and reuses the intermediate data on-chip during the CGD as much as possible. Thus, it minimizes the required external memory bandwidth to 175. 8 MB/s and required on-chip memory capacity from 4 MB to 303. 8 KB.
- Background detection unit with early termination algorithm is proposed to reduce the required computational cost by omitting redundant image patches with zero optical flow, thus reducing the matrix computations by 50. 7%. From the two features, the proposed hardware achieves 99. 4 fps OFE with only 175. 8 MB/s external memory bandwidth. Compared to the previous work, the selection of optimal tile size and increased parallelism helped improve the OFE throughput by 116%.
Proposed Tile-based Hierarchical OFE Basic Concept & Description of the Proposed OFE The basic concept of the proposed tile-based hierarchical OFE algorithm is to divide the input image to multiple smaller image tiles to reduce external memory access (EMA). The CGD iterations, which is performed in the flow generation stage, occupy the largest portion of EMA. It is because, the CGD includes many iterations of matrix multiplication (over 320 × 240 × 26 dimension), whose dimension is too large to store the entire elements on-chip. In addition, the EMA requirement of the CGD is proportional to the size of the input image tile. The conventional OFE [8] performs the CGD for the entire image at once. However, it is not efficient to accelerate the CGD for the entire image in mobile devices which have limited memory capacity. Even though integration of larger on-chip memory might help reduce the EMA, over 4MB of on-chip memory capacity (for QVGA image) is required, which is not feasible for low power consumption.
Therefore, if we can reduce the input image size of the CGD, we will be able to reduce the EMA without the need for large on-chip memory size. Fig. 2 shows the proposed tile-based OFE algorithm. It divides an input image into several tiles, and performs the CGD in a separate manner. In other words, the only data which must be located in the on-chip memory through the CGD iterations are from the current processing tile. Therefore, the required memory size of the proposed tile-based hierarchical OFE is proportional to the tile size, which is much smaller than the whole input image size, to which the conventional method’s memory size was proportional. This tile division method is performed in both spatially (different tiles processed in the different cores) and temporally (different tiles processed at the different time). After CGD iterations are performed on each tile, the generated optical flow images from different tiles are collected in a final optical flow image. In addition, the number of tile division is determined by the size of each scale space image. In other words, the tile division is utilized only for lower layers of the scale space which has a larger image size. However, due to the loss of information near tile boundaries, accuracy degradation must occur in tile-based hierarchical OFE. For example, the target motion vector which lies across different tiles cannot be estimated with a single tile.
Therefore, tile overlapping is applied to prevent the degradation in the proposed tile-based hierarchical OFE. By setting the overlapping region length to include motion vectors near the tile boundaries, we can avoid accuracy degradation. An analysis of proper overlapping length will be shown in Section II-B. The proposed method requires a smaller on-chip memory size but a larger computation workload due to duplicated computations in the overlapping regions. This trade-off relationship is directly affected by the tile size. The smaller the tile size, the smaller the required memory the larger computation overhead. The analysis of this trade-off and proper tile size will be also shown in Section II-B.
There are three advantages of the proposed method. First, the required memory size is reduced by a ratio of (tile size)/(image size) compared to the conventional algorithm [8]. Second, the proposed method shows scalability because the tile size is not affected by the image size. Even though the input images become larger, the proposed method can accelerate the OFE using the same on-chip memory whereas the conventional method requires larger on-chip memory. Finally, a parallel processing inside the input image is available because there is no data dependency among different tiles. Analysis on Tile-based Hierarchical OFE Fig. 3 (a) shows the endpoint error difference of the conventional method and the proposed tile-based method over different overlapping lengths. The endpoint error (EPE) is measured around tile boundaries with the Middlebury dataset [10], and is defined by (1),
EPE= √(〖(u-u_GT)〗^2+〖(v-v_GT)〗^2 )
where (u,v) is the estimated optical flow vector and (u_GT,v_GT) is the ground truth optical flow vector. For the overlapping length larger than 5 pixels, the EPE of the tiling method converges with the EPE of the conventional method. Therefore, the overlapping length is set to 5 pixels. Fig. 3 (b) shows the computation workload and the required memory over different tile sizes with the overlapping length set as 5 pixels. Fig. 3 (c) shows the (required memory × workload) over different tile sizes. In these analyses, the square-shaped tile is used. The overlapping region of length 5 is concatenated to the 4 sides of each tile. The results show that setting tile size as 20 generates the lowest (required memory × workload). Fig. 4 (a) ~ (f) show the proposed tile-based OFE results, with the overlapping length as 5 and the tile size as 20, in comparison with the conventional OFE method. The results show that there is only a 0. 05 EPE difference between the conventional method and the proposed tile-based approach. Overall Architecture Fig. 5 illustrates the overall architecture of the proposed processor. It consists of the 16 tile CGD cores, 1-D SIMD core, scale space core, and top RISC controller. The tile CGD cores are integrated to accelerate the CGD while exploiting the parallelism of the proposed tile-based hierarchical OFE. The 1-D SIMD core is integrated to accelerate the matrix computations which are utilized as generating the parameters for the CGD. The scale space core accelerates the image filtering which generates the scale spaces. The top RISC controller is used for the communication among every component in this processor through network-on-chip (NoC).
Each tile CGD core consists of a tile CGD unit, a background detection unit (BDU), a 20. 3 KB tile CGD memory, and a core controller. The tile CGD unit, which includes 14 5-way parallel multipliers, 16 5-way parallel adders, and an adder tree. The BDU performs background detection for early termination. The 1-D SIMD core consists of an element-wise multiplication engine, an element-wise addition engine, and a 32 KB DMEM. The scale space core has a convolution engine for filtering process of scale space generation and a 32 KB DMEM. The overall flow of the tile-based hierarchical OFE on the proposed processor is as follows:
- The scale space core generates multiple scales of the input image.
- The 1-D SIMD core generates parameters for the CGD operation using the generated images.
- Each tile CGD core operates in a parallel manner to generate an optical flow for each tile with the generated parameters.
- After the optical flow tiles are aggregated, it finally forms an entire optical flow image.
The tile size and parallelism are the two major differences between the proposed processor and the previous work. In 9, the processor was optimized for 90 × 70 tile size and took shared memory architecture. In [9], data of one tile were stored in the shared memory and were accessed by two CGD cores. As a result, 328 KB of CGD memory was required, with 9. 12 GOPS of computation overhead [9]. On the other hand, the proposed processor is optimized for 20 × 20 tile size, which shows the lowest (required memory × workload). Since the memory requirement for processing a single tile has been greatly reduced, several tiles can be processed simultaneously while using the similar sized memory to [9]. Therefore, the memory of each tile is allocated locally to each core, and 16 cores are integrated to operate in parallel. As a result, external memory access of 1. 74 MB / frame is required for real-time OFE using 325 KB of CGD memory. Even though the computation overhead of 93. 3 GOPS occurs, the throughput increases 2. 16 times due to the parallelism of 16 CGD cores.
Proposed Background Decision Unit with Early Termination In the optical flow results, the pixels of zeros do not have any information and they may correspond to a background of the scene. Especially in the case of action recognition, there are numerous pixels with zero motion vector. The OFE result with UCF-101 dataset [11] shows that average 78. 5 % pixels of a scene have zero optical flow. Therefore, CGD throughput can be increased by terminating the computations early for zero optical flow pixels. Fig. 6 shows the flow of the proposed early termination scheme. The CGD processing with the proposed scheme is divided into 2 phases. At phase 1, CGD iterations are performed through the entire image.
Then, the background detection is performed after phase 1 by thresholding the magnitude of the optical flow vector. Thus, the pixels with small optical flow vector are discarded. After the background detection, CGD iterations are performed without the discarded pixels (Phase 2). Early termination of the CGD process achieves 50. 7 % throughput improvement than the conventional method with 0. 7% accuracy degradation in the UCF-101 test dataset. After the background detection, the addresses of non-terminated pixels must be stored for subsequent iterations. Non-terminated pixel’s addresses are stored efficiently using only the distance between non-terminated pixels. In the tile CGD cores, distance map registers of total 320 bytes are integrated to store. Fig. 7 (a) explains the block diagram of the background detector. It updates the distance map register through thresholding the magnitude of each region’s optical flow vectors (du, dv). If the optical flow of the selected region is smaller than the threshold, the background detector increments the stored distance value. If not, the background detector stores the distance counter value into the distance map and resets the distance counter to 0 for the next region. Fig. 7 (b) shows the operation of the tile CGD core for early termination using the stored distance map.
The tile CGD core looks-up the distance map register and fetches the non-terminated pixels’ optical flow vectors from the tile CGD memory. And it feeds the vector into the tile CGD unit. If the distance map value is 0, the operation of the corresponding row is terminated and proceeds to the next row. Implementation Results The proposed OFE processor shown in Fig. 8 is simulated using 65nm CMOS technology and it occupies 12. 8 mm2 die area. The processor can operate at a maximum 200 MHz clock frequency with 1. 2V supply, and it consumes 278. 4 mW power. It achieves 99. 4 fps throughput for QVGA resolution (320×240) at 200 MHz clock frequency. Table II shows the comparison table with previous OFE hardware implementations. The proposed tile-based OFE processor shows 116 % higher throughput than [9]. This is because the proposed processor can exploit higher parallelism with the same memory size by selecting a more optimal tile size than [9]. The proposed tile-based hierarchical OFE processor satisfies the real-time constraint even though the OFE is performed for all pixels (100 % density). [13] can also estimate a 100% density optical flow map, but the proposed algorithm has a lower AAE than. Fig. 9 (a) shows the estimated optical flow images from the UCF-101 dataset [11]. Fig. 9 (b) shows the comparison of action recognition accuracy using the LRCN network [6] with the results from the proposed optical flow estimation and the baseline.
Randomly selected 1000 videos from UCF-101 were used for the test dataset, and the proposed approach shows only 0. 8% accuracy drop. Conclusion In this paper, we implemented and simulated a tile-based OFE processor. It can support the OFE for action recognition UI in mobile devices in real-time with 2 key features:
- The tile-based hierarchical OFE to reduce external memory access efficiently with a limited on-chip memory, the background detection unit with early termination to increase the throughput by 50. 7 %. With these features, the proposed OFE processor achieves 99. 4 fps throughput for QVGA resolution (320×240) at 200 MHz core frequency, showing 278. 4 mW power consumption, and satisfying the real-time constraint.