Hardware Design Automation Of Convolutional Neural Network
The Convolutional Neural Network are based on the feedback system that is based on the history of biological neural system of human and animal anatomy. The CNN’s systems are interesting day by day in vast areas of technologies like; computing based navigation and surveillance and translating the human voice on artificial machine learning. These solutions provide an easiest way to detect any critical position of the events. These learning patterns introduced the way of hardware compilation on GPU’s, FPGA and ASIC based design. These particular tasks accomplished in speed, power consumption in real time with a little bit trade off involves. These tasks are based on embedded systems which have higher priority to performance, power efficiency, faster product while time to market and low-cost parameter there. In this paper, a model based on a tool set that is based on CNN while using FPGA at high level synthesis to precise design of hardware implementation.
Introduction
In this era, the CNN is on the highly based on state art in which different approaches are find for the optimal solution based on the classes of visualized learning algorithms which can understand the pattern recognition, language transformation. The fact by the CNN algorithm is to make a technique to visualize the primary visual cortex of living organisms. In this technique the cells are managed in receptive field that store data from a distinct part of the object. This technique reflects the idea of an imaginary neurons (known as perceptrons), which is different from fully connected layer of Artificial Neural Networks (ANNs).Apart from its performance and stability, CNN are popularly used in many fields, like computer systems, object recognition systems, data management systems in which these parameters are highly affected to the entire technique. These parameters are speed, size, power which can optimized through different algorithms. It seems to the parameters are tradeoff with each other. As for the parameters concerned, a lot of work done and also further process going for best solutions taken by the researchers. The main idea is that the system will running on the highly desired optimal methods by reducing the control structures, it will improve the efficiency of hardware accelerations. The work also covers the Graphic Processing Units (GPUs), Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuits (ASIC) in which desired throughput is necessary.
In this paper, we determine FPGA to locate hardware acceleration of CNNs. This method was produced from different factors. Firstly, we discuss CNN in a very optimal manner to overcome the problem attempts, and also analyze the design area with respect to its resources available in a CNN. This method is not to be considered for ASIC, as the different attempts involves in the simulation process. Secondly, the CNNs have some advantages with respect to its reconfigurability, that is the benefit for the FPGA to test the entire networks, compare consumption and efficiency to provide the optimal solution. Also, the FPGA comes with a well performance per watt with the limitations in GPUs. Despite that, higher performance, the power consumption confirms the main issue behind this.
The main reason behind this context is to learn the steep curve to generate the targeted design as more as the time required to give a better solution. To avoid this issue a large area of Computer Aided Design (CAD) tools prefer on that years and also the researcher’s lights on one tool and its range targeting more specifically the generation and synthesis of CNN for Xilinx FPGAs. This provide a method regarding our network on a pre-determined trained network on the weights based makes a synthesizable code on C++ and the scripts for Vivado HLS and Vivado Design Suite to produce an automatize the implementation on FPGA beginning from High Level Synthesis (HLS) for generation the bitstream.
This context contains our design framework which consists the resource usage of CNN under implementation. Such performance will take the interest of the designer to make such trial and error to make the better improvement without implementing the synthesis step and also to reduce the implementing time. The work consists the better accuracy on the leading parameters of FPGA, to allow the prediction error to reduce in 30 BRAMs, 800 LUTs and 1400 FFs, in most test cases.
This paper contains Section II discusses on related work and Section III consists an overview on CNNs. While, Section IV contains the proposed framework and Section V presents the resource based optimal model. In Section VI, we take the experimental results obtained from our optimal model for the implementation of the network. And the last is Section VII which is based on conclusion and future works.
Related works
In 1990s, Professor Yann LeCun proposed the idea of CNNs, while he was the proposed the original concept based on digits and letter based on handwritten recognition. The best compliment on his achievement which he gives us the concept of complex and differential features with identify specifications on suggested images. Recently, CNN provides more performance and effectively used in supervised learning methods for better image classification.
Moreover, CNN contains the effectiveness in many fields; although the authors performs applied a 3D CNN model on human action recognition, which consists the whole work in reports a CNN- based natural language-based engine processing system.1As apart from the size of the network become to grows, the computational load of the CNN more increases as well. For this issue, many methods are performing by the researchers based on CNN classification phase and effect on the basis of hardware acceleration on GPUs, FPGA and also ASICs. In spite that the training is totally based on the software for the betterment of the control structure, hardware accelerations based on such phase on both GPU and FPGA which makes interest in these results.
The work introduced speaks to the best in class for the increasing speed of CNN convolutional layers on FPGA. The creators construct their outline space investigation with respect to the roofline model, i.e. a model intended to deliver execution estimation in view of both execution top and off-chip memory transmission capacity. As result, the creators displayed an execution of the prevalent Alex Net CNN that outperformed the past works in writing by accomplishing 61.62 GFLOPS (as pinnacle execution) at a working recurrence beneath 100MHz.Farabet et al. displayed a programmable CNN processor executed on a low-end DSP-arranged FPGA. The proposed processor is composed help a vector guidance set that matches the basic tasks of a CNN. On the other hand, the creators created a Lush-based system compiler programming that, beginning from a Lush description of the CNN, gathers a succession of guidelines for the CNN processor. This work was shown for a constant face discovery; however, it could be utilized as a lightweight installed vision framework for portable robots.
Peemen et al. exploited computation rearrangement and native buffer usage so as to enhance output and decrease in energy consumption of CNN hardware accelerators. At constant time, the authors introduced a replacement analytical methodology that relies on loop transformations so as to optimize nested loops for inter-tile information reprocess. Experimental results incontestable a reduction in knowledge movement up to 2.1X and a stimulating boost in Micro Blaze soft-core performance.
Differently from the works offered in literature, the planned framework offers the potential of simply generating a synthesizable CNN for FPGA acceleration, ranging from the network weights. As result, this work might considerably reduce style time and increase productivity, since it avoids the designer the trouble of purification the code in a very HLS compliant way.
Convolutional neuronal networks
CNNs have a particular structure partitioned in two primary squares: the convolutional and the straight parts. The first recognizes them from established ANNs and it is the key purpose of those systems. In reality, convolutional layers are capable of the extraction of highlights from the info pictures. The component extractor is made of a subjective number of layers, more often than not substituted with sub-testing layers.
The second square is a completely associated neural system (moreover known as Multi-Layer Perceptron (MLP)) that is utilizes together the separated data and characterize the information picture. In the accompanying passages we examine in points of interest the extraordinary sorts of layers concentrating on the calculation stream and re-estimating of the information in the feed-forward process.
Convolution Layers: The key parts of CNNs are the weighted channels (portions) that make the convolutional layers. The weights of the portions are characterized amid the preparing stage, in which the expectation blunder with difference to each weight is figured and back-spread utilizing a calculation, for example, Stochastic Gradient Descent (SGD). At high level of deliberation, the element maps, yield of every bit, are acquired swiping the part over the picture. At that point, the consequences of each layer turn into the contribution of the following layer. The measurements of a component guide will be diminished from the first picture as indicated by the measurements of the connected channel: x-axis new = x-axis old – x-axis kernel+ 1(1)y-axis new = y-axis old – y-axis kernel + 1(2)Sub-Sampling Layers: Since CNNs ought to process massive amounts of information, it's helpful to search out some way to separate relevant information from the others. For this reason, convolution layers are normally alternated to sub-sampling ones. Sub-sampling layers have the role to decrease considerably the quantity of information forwarded through the network. Basically, the thought is that the same as for convolutional layers, at associate high level of abstraction a filter is swiped across the image, reducing clusters of pixels to one worth. In Mean-pooling, as associate example, the kernel calculates the common worth of the pixels, making a degree for the new image; another kind of sub-sampling is that the Max pooling, that returns the most value present within the filter matrix. the scale of the output of the sub-sampling layers may be calculated equally to the convolution layers by the equations 34, where p step is that the amplitude of the shift of pooling kernels. x-axis new = [x-axis old – x-axis kernel / p step ]+ 1(3)y-axis new =[y-axis old – y-axis kernel / p step ] + 1(4)Linear Layers: These layers are mounted after the convolution step and they are authorized for the classification process. Its functional unit is the perceptron, a particular artificial neuron which carries the weights of the inputs. Each linear layer has computed outputs as proceed in equation 5.oj = ∑ (????????=0 wi . xi ) +bj(5)The final layer of the linear step of a CNN consists many neurons as much as the number of classes has recognized and it can also be proposed by Log SoftMax operator (equation 6), which has the capability to normalize the output vector z in a set of normalized values depend on the probability of the image for a certain class. o (z)j =????????????∑ ????2????????????=1 forj = 1,………, K(6)
Framework overview
In this context, the total work is depending on web-based framework that permits to design and manage the CNN by interface web-based Graphical User Interface (GUI). The consumer side is based on HTML5 and Java script. And also, the backward portion is based on Python language. This totally framework is depending on high level specification of the network and a document that contain a well-trained weight as inputs. The following subsections of different phases of the work done are modified, so far, the total framework can be sort out in.
Network Configuration
This application introduces the user to interface with simple design process. User must conclude the main network of the CNN in the first step, by taking the number of process (convolutional and linear) layers, and the length of the input data on the network side to be processed. Also, the other input essential to generate the CNN code which is to be set on the basis of weights of the different layers. Thus, the user can check either to upload the file or not, which is to be produced by Machine Learning frameworks that is bought from internet. From the very beginning version, it is possible to generate the random values of the weights in the specific application, so the user can perform the hardware designing of that network according to the performance and resources. By implementing the graphical representation of the network in each layer by taking the constant parameters. In convolutional layers, the number of filters and their length can be determined by the user. Eventually, Max- or Mean pooling can be taken through the sub-sampling layer on the configured stage. As for Linear layers is concerned, the number of neurons in the network can be specified by the user.
Optimization phase
When the network configured is completed, a report of hardware resources is provided by the application (i.e. DSPs, BRAMs, FFs and LUTs) that will be used by CNN IP core for the selected target devices. The technique used to optimize the resources utilization as the major contribution in this paper. The detailed work is also explained in next chapters.
In the light of having occupied resources, the user can select how many cores is essential for the final design, for the implementation the whole network by these cores. In this criterion, the overall cores was distributed in parallel system to achieve the maximum speed. By dividing the images to be processed for the particular core was determined. In this step, the application on back- end side will be produced C++ source code for generating the CNN and also the tcl scripts for using Xilinx Tools as output.
Hardware design
At this stage, the application on the back-end will generate the C++ source code produced the CNN and the tcl scripts used bythe Xilinx Tools as output. The scripts of tcl are used for synthesize the whole design to make the generation of the bit stream with these softwares Vivado and Vivado HLS. As compare to the HLS and CNN core, the pattern of the dataflow algorithm and the direction of the C++ source code produced by the framework (such as nested loops in the pipelining of the convolutional part). Apart from the network of alter lengths and configurations it will be optimized with some specific approach, we cannot explain optimization constrain for taking the design to a fixed network pattern. Wherever, they always possible for the user to re-synthesize the custom core by using code started from starting point.
Resource estimation
This part describes the approach used to permit the framework to estimate aid usages of the CNN under development. As explained in Section IV the availability of these estimations lets in the designer to instantiate multiple CNNs on the reconfigurable material that can be used in parallel.
The translation of the CNN functionality in HDL is done using Vivado HLS, which already presents resource estimation by itself. However, this estimation comes solely after the synthesis section and as a consequence waiting for it would possibly greatly impact development time. Our idea is to supply to the user an estimation of the resource utilization barring the want to go via the HLS phase, as a consequence rushing up the layout phase. In different words we favor to identify a feature F that given the description of a CNN will furnish us an estimation of its resource usages in phrases of Block RAMs (BRAMs).
The term F is defined as F: CNN Set → N4, where CNNs defines all possible set of CNNs. To discover F there are two feasible options. The first one is to function a deep evaluation of the code given as input to the HLS segment and decide how each construct impacts on the ultimate resource usage. The second one instead tackles the trouble from the contrary perspective, aiming at figuring out the model via interpolating records coming from the synthesis of networks with properly recognized parameters.
The shape of CNN code is ordinary sufficient to permit for the first hypothesis, then again, such approach is too bounded to the particular HLS tool considered, and the optimizations that are introduced, automatically or upon request. For this reason, we decided to follow the 2nd method noted above and doing a data-driven exploration.
Performing a data-driven exploration capability that we have to start through synthesizing a positive quantity of networks and then interpolate the acquired data. Ideally the preliminary data be a consultant pattern of the complete multidimensional space that can signify a given CNN. As an example, if we seem at the parameters needed to represent a CNN having two convolutional layers and two linear layers, we need to discover a 16-dimensional space: 3 for the input image, 4 for every of the convolutional layers, 2 for every of the linear layers, and 1 for the classification and prediction stage. So, our F function is a feature N 16→N4 if we think about networks composed solely of one convolutional and one linear layer. 3Clearly, sub-sampling a 16-dimensional area is a cumbersome challenge and will require a large variety of synthesis, possibly impairing the advantage of the availability of the models. We decided to break down the problem in specific subproblem and more specially we started from the assumption that each convolutional, linear, and classification layers can be analyzed on its very own and then aid utilization of each layer can be summed up to achieve the normal aid utilization of the network. A standard CNN ∈ CNN Set is then defined as: CNN= {CONV, LINEAR, Classes}where CONV and LINEAR are the set of all the convolutional and linear layers existing in the CNN. Note that from this description we eliminated the characteristics of the input image since their values are directly correlated to different parameters in the convolutional and linear layers and are so redundant in the description.
Experimental evaluation
This boundary seems to providing resource estimation model by evaluating the previous chapters. By varying the synthesized of 15 networks by compiling with different parameters.
- The input image size changes from 16 × 16 to 64 × 64;
- The convolutional layer number is either 1 or 2, the features map belongs from input and output side start from 16 to 1536, and also the neuron range is 10 up to 210;
- The classification layer part uses 10 classes. None of the layer configuration will be perform for the evaluation phase in the progress of train models in previous sections.
Conclusion and future work
The whole context concludes about the framework design of CNN based on web-based. This framework permits the faster customization of the network and its realization depend on reconfigurable device through Vivado HLS application interfacing. The major works belongs to models analyzing to forecast the estimation resources of the CNN under progress so that, it permits the designer to adjust more than one network, only by parallelize the products over the dataset. The models that we prefer is to check the utilizing resources without any synthesis of speed carrying elements or circuits process. The overall accuracy is better except minimum absolute error in 30 BRAMs is considerable in more than 80% CNN tested. The future work will also depend on improving the BRAM elements to make a robust system in coming era.