Recognizing Urdu Language Speech Using Deep Learning Techniques
Speech recognition is one of the tools that will be needed to make a machine truly intelligent. A lot of research has been done on this topic; many voice-driven applications are proof of it. Voice-Driven personal assistant tools are increasing in the market. With the increase of such tools, the need for efficient speech recognition systems also increases. Many efficient speech recognition tools exist in the market, and some are also available online. The major drawback of these tools is that they only recognize the English language speech or other high resource languages. One idea of dealing with low resource language is to use a high resource language in combination with the required low resource language. An assumption was made that the high resource and low resource language will share roots. The major flaw with the presented system was that the high resource language used was also English. The English language may share roots with many European languages but fails when it comes to languages frequently spoken in the subcontinent. Pakistan being the sixth most populous country holds a huge market for voice-driven personal assistant tools. But most tools that are developed either use English or support languages that share roots with the English language. Such tools fail to recognize the Urdu language. This shows the necessity to have an automated system that can recognize the Urdu language. The Urdu speech recognition can also be used as a base for training speech recognition system of many regional languages as well. In this research, we will create a dataset with the help of a web and Android application. An automated speech recognition system will be developed by applying multiple deep techniques to the gathered data.
Introduction
Speech recognition is one of the few human skills that we need to make a machine truly human. We humans automatically convert sound waves to text in our minds. But to make a machine that can perform this same task is proving to be quite a challenge. After the introduction of personal assistant tools, such as Alexa of Amazon, Google Assistant, Cortana of Microsoft and Siri of Apple, a lot of research resources have been allocated towards this field. But even with all the resources allocated to it, speech recognition still has a lot to overcome. Most state-of-the-art speech recognition software and tools work well on high resource languages such as English and Mandarin but perform very poorly on low resource languages such as Urdu and Persian etc. Low resources languages are those languages that are not spoken in many parts of the world. Since they are not spoken in many areas, there are less publically available resources to work with them. The issue of resources not being available is somewhat resolved for the languages that share graphemes and phonemes with the English language. But the same cannot be said about the languages that are widely different from the high resource languages, in terms of phonemes and graphemes.
Even though the use of personal assistant tools has increased in Pakistan; it has not reached its full potential. One of the reasons is that most tools do not work well with the national language of Pakistan, Urdu. Urdu shares a lot of its phonemes with Arabic and Persian but differs a lot when it comes to graphemes. To target the market in Pakistan a lot of work needs to be done in Speech Recognition of Urdu language. Presently there are not many tools available in the market that can effectively translate Urdu speech into Urdu text. Some tools may exist, but they are very cost heavy. The significance of the chose topic is that it purely targets our national language in terms of speech recognition. Urdu has around 60 million native speakers and is currently lies in the list of top 25 commonly spoken languages in the world. A working model of this language can not only serve as the base for many regional languages spoken in Pakistan but will also achieve the same purpose with languages that phonemes and graphemes with it.
Background and Literature Review
Neural networks have always played a vital role in the development of speech recognition systems. Earlier models of speech recognition systems used combinations of HMMs (Hidden Markov Models) and neural networks to classify the phonemes. In the more recent years, huge improvements have been made in the process of speech recognition by using RNN or Recurrent Neural Networks. The RNNs are used in combination with CTC to produce the best possible results. Speech is a form of continuous data i. e. the current phoneme will depend on not only the previous but future phoneme as well. Whereas, RNN doesn’t require any past information about either the dataset or the type of the data it is processing and the objective function used with RNNs requires data segmentation. When the data is segmented, each point in the segmented data is considered independent of its past and future data points. The main idea behind CTC was to develop a probability distribution that will network the generated output over all possible combinations or sequences that can be formed when a particular input is given. CTC also introduced the idea of using a special character called blank. The blank symbol was inserted whenever no conclusive output phoneme could be produced. The combination of RNN and CTC is used many papers such as
Many papers also use sequence-to-sequence models such as Listen, Attend, Spell or LAS model [6] and RNN Transducer. The LAS model uses an encoder (Listener Module), a decoder (Speller Module) and attention model. The encoder takes an input of an audio file and produces features vector. The feature vectors are then taken input by an attention model which generates a context vector on the basis of the past, current and future inputs. The attention model uses bi-directional RNN to produce its output. The decoder than predicts the output or phoneme corresponding to the current input. The idea behind LAS model was to remove the need for using a Language Model to predict the final model. The LAS has been used as a basis for many latest models of Speech Recognition.
A lot of work has also been done towards recognizing multiple languages by a single automated system. The model presented in the paper had a WER of 22. 91, but the model had the ability to not only recognize the text but language as well. Recent research presents one of the most recent works done towards speech recognition of multiple languages. Most of the languages used in the world today are evolved from one language or another which implies many languages share some graphemes and phonemes. The universal character set includes all of the common phonemes and graphemes as well the characters that are distinct to a language. The common characters are trained on all dataset whereas language specific characters are only trained on for that specific language.
Research Question and Objectives
The purpose of this paper is to develop an automated speech recognition system for the Urdu language, using one or more techniques of deep learning. Many speech recognition systems have been developed using different deep learning techniques, but they do not have models available w. r. t the Urdu language. In terms of population, Pakistan is the sixth largest and thus holding a huge market for personal assistant tools. Accurate and effective Urdu speech recognition is required to tackle the market. Developing such an automated system will also lay the base for developing systems for smaller regional languages spoken in Pakistan such as Punjabi, Sindhi, Pushto, and Balochi. The reasons mentioned beforehand shows the need for developing an automated Urdu Speech recognition system.
Research Methodology
The first thing to be done for the thesis would be to create our own corpus. The corpus or dataset will be collected by means of an android and web application. The Android and web application will display a sentence and will ask the user to read the sentence. The voice of the user will be recorded in the form of a 16-kHz wave audio file. The moderator or the admin of both the applications will verify whether the recorded audio files match with the given sentence or not. All the audio files that are not verified will not be included in the training set. The corpus for the sentences will be generated by web scraping multiple Urdu news websites. Different deep learning techniques will be applied to the generated corpus. The deep learning technique that gives the best result will be used in the development of the automated system.
Conclusion
In this thesis, we will describe an automated system that will take the input of an audio file and convert its content into speech. This automated speech recognition system can be used to achieve various tasks.