Approaches And Components Of Speech Recognition
Speech Recognition is the process of converting speech signal to a sequence of words by means of algorithm implemented as a computer program. Speech recognition process aims to give a machine the ability to hear, understand and act upon spoken information. Speech recognition concentrates on recognition of spoken word, recognition of speakers, recognition of spoken language and recognition of emotion. Music information retrieval works on analyzing the structure of music and retrieving similar pieces of music, instruments and musical genres. Environmental sound retrieval includes types of sounds neither music nor speech. There are four basic approaches for developing automatic speech recognition system: Template Matching: it is a whole-word matching which is the most widespread, and commercially available approach to automatic speech recognition. The method involves obtaining one or more utterances of every word that is to be recognized from one or more speakers and storing the spectrographic representations of these utterances via a voice coding (filtering and digitizing) process.
A whole-word template matching recognizer can only be successful on those words with which it has been trained. With a reasonably large vocabulary, it is obviously inefficient.
Stochastic Approaches: is the process of making a sequence of non- deterministic selections from among sets of choices. They are non-deterministic because the selections are not specified in advance but governed by the characteristics of the input.
Speech recognition can be viewed as a situation where selections have to be made based on uncertain information and stochastic modeling is a flexible and general method for handling situations where uncertainty overcomes. Like template matching, stochastic processing requires the creation and storage of models of each of the items that will be recognized. However, unlike template matching stochastic processing involves no direct matching between stored models and input. Instead, it is based upon complex statistical and probabilistic analyses that are best understood by examining the network like structure in which those statistics are stored. The hidden Markov model (HMM) is one popular stochastic approach that can be used to deal with speech recognition problems.
Neural Networks: are described as connectionist systems because of the connections between the individual processing nodes. They are also known as adaptive systems, because the values of these connections can change so that the neural network performs more effectively. People also refer to them as parallel distributed processing systems after the way in which the many nodes or neurons in a neural network operate.
Neural Networks are best classification systems. They specialize in classifying noisy, patterned and variable data streams containing multiple, overlapping, interacting, and incomplete signals [40]. Speech recognition is a classification task that has all of these characteristics, making neural networks a possible alternative to speech recognition.
Like stochastic and template matching techniques, neural networks require training. However, neural networks do not require that a complete specification of a problem be created prior to developing a network-based solution. Instead, networks learn patterns solely through exposure to large numbers of examples, making it possible to construct neural networks for auditory models and other poorly understood areas. Neural networks use manually-segmented or labeled speech elements and often store speech items as whole units rather than as time sequences.
Knowledge-Based Approaches: The main idea is to compile and incorporate knowledge from a variety of knowledge sources so that the system imitators a human being.
The knowledge sources that are considered useful include:
Acoustic knowledge – evidence of which sounds are spoken on the basis of spectral measurements and presence or absence of features.
Lexical knowledge – the combination of acoustic evidence so as to nominate words as specified by a lexicon that maps sounds into words or equivalently decomposes words into sounds.
Syntactic knowledge – the combination of words to form grammatically correct strings, according to a language model, such as sentences or phrases.
Semantic knowledge – understanding of the task domain so as to be able to validate sentences or phrases that are consistent with the task being performed, or which are consistent with previously decoded sentences.
Pragmatic knowledge – inference ability necessary in resolving ambiguity of meaning based on ways in which words are generally used.
All these knowledge sources interact in order to arrive at a decision about a speech sample. The crucial question, with such a complex scheme is “how do these levels interact?”.
Automatic speech recognition (ASR) has the following components.
Front End: The front end extracts data from the digital representation of the spoken words, setting it into a form the decoder can use. To make pattern recognition easier, the digital audio (spoken words) is transformed into frequency domain. In the frequency domain, frequency components of a sound can be identified. From the frequency components, it is possible to approximate how the human ear distinguishes the sound. The transformation results in a graph of the amplitudes of frequency components, describing the sound heard.
Acoustic model: is building statistical models for some meaningful speech units based on the feature vectors computed from speech. The speech signal is first chunked into overlapping 20-30ms time windows at every 10ms and the spectral representation is computed from each frame.
The meaningful units of speech that are most often used in ASR are phones. Phones are the minimal units of speech that are part of the sound system of a language, which serve to distinguish one word from another. The dominant approach to acoustic modeling in speech recognition is to use Hidden Markov Models (HMMs). An alternative to the standard HMM approach is a hybrid approach in which Artificial Neural Networks (ANN) and HMMs are employed. In order to recognize speech, the acoustic models should first be trained. During training, the parameters for the models are estimated from recorded speech material which has been orthographically transcribed (i.e., at word level). Moreover, a phonetic transcription of the words is needed. Transforming a word sequence to a phone sequence is accomplished by looking up the phonetic transcription for a word in the lexicon.
Pronunciation Model: consists of the orthography of words that occur in the training material and their corresponding phonetic transcriptions. It specifies the finite set of words that may be output by the speech recognizer. The transcriptions can be obtained either manually or through grapheme-to-phoneme conversion. A pronunciation dictionary can be classified as a canonical or alternative on the basis of the pronunciations it includes. For each word, a canonical pronunciation dictionary includes only the most probable pronunciation that is assumed to be pronounced in read speech. It does not consider pronunciation variations such as speaker variability, dialect, or co-articulation in conversational speech. On the other hand, an alternative pronunciation dictionary includes all the alternate pronunciations that are assumed to be pronounced in speech.
Language Model: which includes the dictionary, is used by the decoder to determine the most likely suggestion. The language model describes to the decoder, the relationship between words and the probability of words appearing in a particular order. Language models may be domain specific. The language model is referred to as the grammar when referencing context free implementations. Its goal is to predict the likelihood of specific words occurring one after another in a given language. Typical recognizers use n-gram language models, most commonly trigram. An n-gram contains the prior probability of the occurrence of a word (unigram), or a sequence of words (bigram, trigram etc.): unigram probability P(wi)bigram probability P(wi|wi−1)ngram probability P(wn|wn−1,wn−2, …,w1).
Decoder: The acoustic model is produced or prepared by the ASR training tools. The trainer processes audio files with their text transcriptions to extract common acoustic characters for each individual phoneme (context independent phoneme) as well as each phoneme with its context (context dependent phoneme). Actually, each phoneme has been divided into 5 states in a time sequence, since the characters are slightly different from begin to end. All phoneme-state based characters form a voice model. Hence, the decoder is the speech engine that takes the acoustic model, the pronunciation model, the language model, and observation sequence and outputs the most likely sequence of words.
Feature Extraction: converts the speech signal into a sequence of acoustic feature vectors. The goal is to extract a number of features from the signal that have a maximum of information relevant for classification. That means features that are robust to acoustic variation but sensitive to linguistic content are extracted.