The Effects of Hindi Dubbed Cartoons on the Behavior and Language Mix-up of Children

Investigation and Development of Handwritten Hindi and Gujarati Optical Character Recognition

Introduction

1.1 Introduction of Handwriting

1.2 Introduction and Brief History of Character Recognition

1 3 Hindi Script

1.4 Gujarati Script

1.5 Motivation

1.1 Introduction to Handwriting

The importance of the piece of paper cannot be ignored in enhancing people's memory and in facilitating communication between people. It is used for both personal (letters, notes, addresses on envelopes etc.) and business communications (bank cheques, tax forms, admission forms etc.) between person to person and for communications written to ourselves (reminders, lists, diaries etc). Handwriting is the most common and natural means of communication for humans. The concept of handwriting is very old and attributed by many civilizations and cultural ages. However, the solitary purpose is to facilitate communication and expand human memory.

According to Plamondon and Srihari (2000), the terms 'Handwriting' and 'Handwriting Recognition' can be defined as: 'Handwriting is the task of transforming a language represented in its spatial form of graphical marks into its symbolic representation'.

'Handwriting Recognition is a process that allows computers to recognize written or printed characters such as numbers or letters and to change them into a form that the computer can use for editing and searching.

In modern society, we rely heavily on computers to process large amount of data. Thus, due to economic reasons or business requirements, there is a great demand for input the huge amount of printed and handwritten information into the computer. Though these data exist on paper and they have to be feed into the computer by human operators, for example, letters in mail, checks, income tax forms, payment slips and many other business-related documents. Such time consumed and error-free processes have been enlightened by the invention of OCR (optical character recognition) systems that read handwritten or printed data by recognizing them at high speed, by reading one character at a time. However, the current OCR systems capabilities are still inadequate and only small fraction of data can be entered into the computer by them. And so lots of efforts are still needed to enable them to read printed and handwritten characters more accurately.

1.2 History of Character Recognition

The earliest work on handwriting recognition was carried out in the sixties and seventies. Due to the poor performance achieved by these systems at that time, less research on handwriting recognition took place during the eighties. The problem of handwriting recognition was initially considered as being very easy to solve, but has later proved to be very difficult although some existing handwriting recognition systems run quiet well for specific applications, these systems have still some drawbacks. It is difficult to analyze, how they work, it is impossible to precisely locate the origin of the errors that they make and to correct them in order to improve their general performance. They are also time-consuming and they need very large databases for training. At the dawn of the 3rd millennium, Human Handwriting Processing (HHP) is emerging from its infancy and set to become a mature technique. The author shall probably see in the near future a number of mixed systems able to read both online and offline handwriting. Author would also like to see a second generation of handwriting reading systems consuming less memory and time but, fitted with some perceptual faculties with the ability to interpret ambiguous data entries. More generally, there is a clear need for methods designing perceptual and interpretative systems which will lead to efficient and easy to use multimodal and multi-lingual interfaces. The focus of this chapter is on a survey of research in the handwriting recognition domain. In this respect, Author organizes the research reported in the said field in different categories that are descried one by one. This review aims to put forth a study and analysis of handwriting recognition system developed in late nineties Table 1.1 on next pages, describes the evolution of character recognition systems over the past few years.

Period

General methodology

Remarks

1950’s

The world was modeled as being composed of blocks defined by the coordinates of their vertices and the specification of the edge

Information

Quality was heavily dependent on the ability to segment the original intensity image.

1960’s

Integrated segmentation and interpretation systems.

This era saw the rapid improvement in image acquisition with equipment developing in quality.

1970’s

Development of computational, algorithms and implementation level.

The research consisted mainly of edge findings, region growth and segmentation and higher-level processes such as shape recognition and reasoning.

1980’s

A new direction in computer vision emerged in the form of active vision. Visual perception was treated as an active process because the vision system constantly adapts to a changing environments. e.g. exploring, looking and searching for information.

Decision theory is the framework behind information fusion and control of different sensors.

1990’s

More efficient algorithms: Dynamic Programming Matching, Hidden Markov Models (HMM), Neural Networks (NN) etc.

A renew of interest occurred with the rise of postal and banking applications, portable computers, with new and more suitable acquisition systems such as scanners, pen-pads, electronic papers etc.

2000 onward

The combination or cooperation of several independent recognizers, the use of lexicons or dictionaries and of language models.

In this era, post-processing was been suggested to improve the overall efficiency of the system.

Optical Character Recognition is the most crucial part of Electronic Document Analysis Systems. The solution lies in the intersection of the fields of pattern recognition, image and natural language processing. This is a area of research used in computer vision, artificial intelligence (AI) and pattern recognition (PR). OCR provides the mechanism to convert machine-printed or handwritten document file into editable text format. There are two ways two recognize optical characters, offline and online character recognition. The optical Character recognition system especially converts the scanned text documents into editable text which can be saved in digital form.

The researchers are dreaming about making machines that can perform tasks like reading the text like a human is not new. Although there has been a tremendous research effort, the state of the art in the OCR has only reached the point of partial use in recent years. The very first attempt was made in 1870 when C. R. Carey invented an image transmission system. At the starting decade of the nineteenth Century, many researchers made attempts. But the modern OCR version came into the picture in 1940 decade.

1.2.1 First Generation:

The first generation commercial OCR system appeared during 1960 to 1965. A constrained letter shape read was included in this OCR. The characters were designed for machine recognition up to 10 different fonts were recognized by these OCR.

1.2.2 Second Generation:

The second-generation reading machines were appeared in the middle of the 1960s an up to early 1970’s. The systems developed were capable to recognize hand-printed character along with regular machine-printed characters. The first system of this kind was IBM 1287. The characters in Latin script were standardized in the period along with the OCR-A and OCR-B systems. The fonts were designed such that they are readable and can be recognized by a machine.

1.2.3 Third Generation:

The third generation OCR was appeared in the middle of the 1970s. The challenge was to recognize poorly scanned documents having a hand-written character set. Due to advancements in information technology low-cost solutions with higher accuracy were achieved as the main objective.

1.2.3 Present OCR:

Today many software packages capable of optical character recognition were developed and available at a very low cost. For Latin script, Omni font OCRs are also available in the market. The systems are also available for English, Latin, Chinese, Cyrillic, far eastern and many Middle Eastern scripts. For Hindi script, the OCR systems are still in the research and development stage, mainly because of the lack of a commercial market available for Hindi script.

1.3 Hindi Script

Hindi language, a member of the Indo-Aryan group within the Indo-Iranian branch of the Indo-European language family. It is the preferred official language of India, although much national business is also done in English and the other languages recognized in the Indian constitution. In India, Hindi is spoken as a first language by nearly 425 million people and as a second language by some 120 million more. Significant Hindi speech communities are also found in South Africa, Mauritius, Bangladesh, Yemen, and Uganda.

Literary Hindi, written in the Devanagari script, has been strongly influenced by Sanskrit. Its standard form is based on the Khari Bolidialect, found to the north and east of Delhi. Braj Bhasha, which was an important literary medium from the 15th to the 19th century, is often treated as a dialect of Hindi, as are Awadhi, Bagheli, Bhojpuri, Bundeli, Chhattisgarhi, Garhwali, Haryanawi, Kanauji, Kumayuni, Magahi, and Marwari. However, these so-called dialects of Hindi are more accurately described as regional languages of the “Hindi zone” or “belt,” an area that approximates the region of northern India, south through the state of Madhya Pradesh.

Hindiscript is the most ancient, important and widely used script in India. It is the script from which many Indian languages like Hindi, Marathi, Nepali and Sanskrit [4-8] are derived. Several other languages like Kashmiri and Punjabi uses close variation of this script. Basically Hindibelongs to the Brahmi script family. An evolutionary transition can be seen from Brahmi script to the Gupta script, then Gupta script to the Nagari script and further Nagari script to Hindi script. Hindi was first seen historical documents in early 7th century A.D. The more stable form of Devanagari can be seen since the 11th century onwards. The current appearance of the script was reached around 12th century. The word Hindiis considered to be combination of two words ‘Deva’ meaning God, Brahma or the king and ‘nagara’ meaning the city. The name Hindiscript is recent to the older term Nagari. The Hindiscript represents consistent sounds which are grouped together. The letters of the English can be pronounced in different ways at different places, but the letters of the Hindiscript have the same pronunciations. Some of the differences in Latin and Hindiscripts

Are as follows:

· In Hindiscript each character has a horizontal bar at the top called as Shirorekha. If you write the word the Shirorekha of all characters is connected. For next word in the sentence it is broken to differentiate between two words.

· Distinct letter cases i.e. upper case and lower case is not available in Hindi alphabets.

· The concept of modifiers or matras is available in Hindi which is not present in English or Latin script. They can be present as a standalone character or can be combined with other alphabets.

1.3.1 Alphabets:

There are nearly 50 basic characters in Hindi script. The group of vowels is called as Swaras and the group of consonants is called as Vyanjanas. The grouping of these vowels and the consonants is done on the basis of phonetic point (sound) of articulation. Vowels often take modified shapes called modifiers or matras when used in word. Modifiers using Consonant are also possible. Maximum 2 to 5 consonants can combine to form a compound characters called as conjunct. Apart from this there exist a set of sign or diacritical mark which indicates the nasalization of vowels.

1.3.1.1 Vowels:

Hindi contains nearly 18 vowels out of which 11 are frequently used. Others vowels can be seen mostly in the Vedic and non-Vedic Sanskrit text.

Vowels in Hindi are transcribed in two forms: the independent form, and the dependent (matra) form. The independent form is used when the vowel letter appears alone, at the beginning of a word, or immediately following another vowel letter. Matras are used when the vowel follows a consonant. Apart from these matras, there also exist another set of vowels which has been added to the traditional Devanagari script.

1.3.1.2 Constants

There are around 33 consonants in hindi script. They are grouped phonetically. The first set of 25 consonants is called occlusive, and rests 8 are called non-occlusive. The occlusive consonants are further divided into five groups: gutturals, palatals, cerebals or retroflex, dentals and labials. The first four consonants in these groups are further divided in two groups: plosive and voiced plosive and the last consonant is the nasal consonant. The plosive and voiced plosive are again divided into unaspirated and aspirated version (each having one character). There 8 non-occlusive consonants are divided in three groups semivowel or approximant, sibilants and aspirate each have four, three and one character respectively.

1.3.1.3 Conjunct:

A combination of two to five consonants is called as Conjuncts. There are nearly thousand different conjuncts in hindi script. Few of these conjuncts partially retain the shape of the constituent consonants. Some conjuncts are like other forms which are not derived from the letters making up their components.

Diacritics are some glyphs added to a character, or basic glyph for changing the sound of the letter. Some of the mostly used diacritics in Hindi script are Visarga, Chandra, Halanta and Nukta. Visarga is an unvoiced variation of ha. Chandra is an open mid front rounded independent vowel. In its dependent form it is placed on the top of the consonants. Chandrabindu, is use to represent the inherent nasalization of the vowel. Halant is use to represent a lone consonant without a vowel. It kills the vowel “A” and reduces the consonant to its base form. Nukta is used represent the Persian sound encountered in some of the borrowed Urdu words.

1.4 Gujarati Script

The Gujarati script is an adaptation from the ancient Nagari script. The earliest known document in the Gujarati script is a manuscript dating from 1592, and the script first appeared in print form in a 1797 advertisement [1]. Until the 19th century it was used mainly for writing letters and keeping accounts. Gujarati language is from the Indo-Aryan family of languages and is used by more than 50 million people in the Indian states of Gujarat, Maharashtra, Rajasthan, Karnataka and Madhya Pradesh. It is also used abroad in Bangladesh, Fiji, Kenya, Malawi, Mauritius, Oman, Pakistan, Réunion, Singapore, South Africa, Tanzania, Uganda, United Kingdom, USA, Zambia and Zimbabwe. A formal grammar of the precursor of this language was written by Jain monk and eminent scholar Hemachandracharya. It has a rich oral culture and a literary tradition that dates back to the tenth century. The construction of Gujarati can be considered somewhere between those of Hindi and Marathi.

Gujarati - script used to write the Gujarati language, is a multilevel script like other Indian languages, the only difference being that it does not have headline or “shirolekha” like Bangla or Hindi. The presence of “shirolekha” in other Indian languages helps to identify the word as well as to extract upper modifiers. This not being the case with the Gujarati language makes it all the more difficult to segment text line and thereafter extract words from handwritten text. This paper describes the process of segmentation of words from handwritten Gujarati text line into words. The basic assumption is that the input image is a preprocessed digitized image of Gujarati text line with sufficiently good handwriting.

1.4.1 Vowels

As in other Indian languages, the character set of Gujarati comprises of - 35 consonants, 13 vowels and 6 signs, 13 dependent vowel signs, 4 additional vowels for Sanskrit, 9 digits, and 1 currency sign. The consonants can be combined with the vowels and can form compound characters.

There are many possibilities for the conjunct consonants  that increases difficulties in segmentation and identification of the character.

1.5 Motivation

OCR finds wide applications as a telecommunication aid for the deaf, postal address reading, direct processing of documents, foreign language recognition etc. This problem has been explored in-depth for the Latin script. This research work describes an OCR engine that processes Hindi and Gujarati numerals and characters. Modern optical character recognition can be segmented into four main phases: pre-processing, segmentation, identification, and post-processing. The pre-processing phase prepares the image for the next phase and includes such tasks as noise reduction and skew correction. The segmentation phase deals with dividing an image into sub-images, where lines, words, and, ultimately, individual characters are separated. The identification phase accepts the output from the isolation phase and attempts to recognize the sub-images passed to it as characters. The post-processing phase attempts to reassemble the characters into words and sentences. The main focus of this thesis is to recognize Hindi and Gujarati especially Marathi numerals and characters.

Problems Identified:

· Limited research in skew detection and correction for Hindi and Gujarati documents

· Unavailability of proper segmentation algorithm for improved accuracy

· Inability to separate text part from document image accurately

· Requirement/Use of a large number of features for recognition

· Requirement of separate recognition engine for Handwritten and Printed Documents

· Need of Separation of Handwritten and Printed Documents in Hindi and Gujarati script

Problems Definition:

The handwritten texts in the document need special attention because printed text differs in font face and font size; whereas handwritten text characteristics vary person to person. A document may contain different symbols in multi-directional text (ex. vertical, diagonal etc) therefore the problem of character recognition becomes more difficult.

Some of these issues will be considered to propose new solutions to overcome the problems mentioned. Also to improve the speed and efficiency of document text recognition new techniques can be employed. The intention of this thesis (work) is to find an investigation and development approach for the recognition of Handwritten Hindi and Gujarati characters and numerals. Human-computer interaction is a very necessary part of today’s lifestyle which can be used in search engines, social media, artificial intelligence etc. There is a scope of improvement in preprocessing, segmentation, feature extraction, feature selection, clustering and classification stages of OCR for improving the overall performance of the system.

29 April 2022
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now