Challenges Of Text Recognition In The Natural Image
Text recognition in the natural image is still a challenging task due to complicated environment. The challenges regarding complexity of natural images for text extraction can be broadly seen from three different angles.
- There may be variation in natural scene text due to uncontrolled surrounding which reflects absolutely different font size, style, color, scales and orientation.
- Scene background complexity has challenges like roads, signs, grass, building, bricks and paves etc.
- Some intrusion factors like noise, low quality, distortion, non-consistent light also creates problem in natural scene identification and extraction.
Problem such as blur, text features weakened or lost exist in text region because of dust on images. As the Text recognition gives rise to numerous application, the fundamental aim is to determine whether or not there is text in a given image and if it is there then the problem is in detecting, localizing and recognizing it. So, for text detection, pre-processing and post -processing is mandatory task. Mostly Text enhancement is used to rectify distorted text or improve resolution.
The analysis of challenges of text detection in given image is can be given as: Scene entanglement: The challenge with scene complexity is that the surrounding scene makes it difficult to discriminate text from non-text. Many man-made objects, in natural environments, such as buildings, symbols and paintings appear, that have similar structures and appearances to text. Example like character ‘Z’ can be seen as a design on gate of houses, character ‘O’ can also be seen in given house image.
Improper lighting: Sensory device’s uneven response and illumination is the main cause of uneven lighting when capturing images. Because of uneven lighting colour distortion and deterioration of visual features false detection, segmentation and recognition results.
Blurring and degradation: Flexible working conditions, focus free cameras, image compression -decompression procedures defocuses, blurs and degrades the quality of text. These factors reduces characters’ sharpness and introduce touching characters, which makes basic tasks such as segmentation difficult.
Aspect ratios: Text has different aspect ratios. Text such as traffic signs, may be brief, while other text, such as video captions, may be much longer.
Distortion: When the optical axis of the camera is not perpendicular to the text plane, Perspective distortion occurs. Text boundaries lose rectangular shapes and characters distort, decreasing the performance of recognition models trained on undistorted samples.
Fonts: Characters of italic and script fonts might overlap each other, making it difficult to perform segmentation. Characters of various fonts have large within-class variations and form many pattern sub-spaces, making it difficult to perform accurate recognition when the character class number is large.
Multilingual environments: Most of the languages have many characters, languages such as Chinese, Japanese and Korean (CJK), have thousands of character classes. Arabic has connected characters, which change shape according to context. Hindi combines alphabetic letters into thousands of shapes that represent syllables. Because of the challenges of text detection dataset have to be generated in huge number. Then only output will be more accurate. For this most popular technic deep learning model can be used. But model training has become more time-consuming process because of availability of large dataset and increasing complexity of deep learning model. In many application like Computer vision, Natural language processing, and Speech recognition Deep neural network is the most successful technic. Specifically, in the computer vision domain, Convolutional Neural Networks have improved results on object detection, recognition and enabled industrial applications. For this, Training need to be done over larger dataset because model has million parameters and so complexity is also increasing day by day.
Training a CNN model is a time-consuming process. For speeding up this process three criteria should be considered. First, specialized processors, (Graphics Processing Units (GPUs), TPUs etc. ) and software libraries (CuDNN, fbfft) can be used. Many of the popular open source Deep Learning (DL) frameworks now offer distributed versions that allow the user to train models that utilize multiple GPUs and even multiple nodes. Some comparison which can be shown for different distributed versions stack up against each other. We can also measure the quantitative and qualitative performance in terms of time, memory usage, and accuracy for Caffe2, Chainer, CNTK, MXNet, and Tensorflow as they scale across multiple GPUs and multiple nodes. Among various frameworks of deep learning, open- source packages that support distributed model training and development, gives more proper results. Five selected framework are given as follows:
- Caffe2 is a light-weight and modular DL framework open sourced by Facebook. It emphasizes model deployment for edge devices and model training at scale.
- Chainer is a flexible DL framework developed by Preferred Networks that provides an intuitive interface and high performance implementation. The distributed version is ChainerMN. Rather than separate the definition of a computational graph from its use, Chainer uses a strategy called "Defined-by-Run" where the network is created when the forward computation takes place.
- Microsoft Cognitive Toolkit (CNTK) is a commercial grade distributed deep learning toolkit developed at Microsoft. It also has advanced algorithms but these are not under open-source licenses.
- MXNet is a flexible and efficient library for deep learning, featuring high-level APIs. It is sponsored by Apache Incubator and selected by Amazon as its choice for DL.
Tensorflow4 is a general numerical computation library for data flow graphs. It was developed by Google Brain Team and is currently an open source project for machine learning and deep learning domains. The most important text identification classes are: texture, connected components and hybrid approaches.
Recently, MSER, SWT and binarization techniques are popular for extracting text from natural scene. In the proposed methodology, CNN architecture is proposed for character recognition and character labeling using deep learning architecture. Serverless platforms promise new capabilities that make writing scalable microservices easier and cost effective, positioning themselves as the next step in the evolution of cloud computing architectures. It has been utilized to support a wider range of applications. Recently there has been a surge in the use of machine learning and artificial intelligence (AI) by both cloud providers and enterprise companies as value added differentiating services. In particular, we refer to the use of deep learning, a particular field of machine learning that utilizes neural networks to provide services that range from text analytics, natural language processing, and speech and image recognition. Training neural networks is a time consuming job that typically requires specialized hardware, such as GPUs. Using a trained neural network for inferencing (a. k. a serving), on the other hand, requires no training data and less computational power than training them. For the implementation of a wide range of event based cloud applications, serverless computing is best.
So, finally this system will get integrated with AWS platform. Amazon Web Services (AWS) is a subsidiary of Amazon. com that offers on-demand cloud computing platforms. . . . Cloud computing is the on-demand delivery of compute power, database storage, applications, and other IT resources through a cloud services platform via the internet with pay-as-you-go pricing.