Subword Based Language Modeling And Automatic Speech Recognition
Since ASR system is able to recognize only the words that are in the recognition vocabulary, coverage is a critical point in language modeling. Amharic suffers from data sparseness and out of vocabulary words problems as it is a morphologically rich language. In Amharic, it is possible to obtain many different words from a single stem, by adding prefix, suffixes and postfix. As a result, word-based models result in high OOV rate in agglutinative languages; increasing the vocabulary size is a solution but it requires more memory and processing power. In addition, high vocabulary size causes the language model to be non-robust due to the data sparseness. Much more amount of training data is needed to model a large number of words. Therefore, the feasible solution is using subwords as the language modeling unit, instead of words.
Segmentation
Segmentation is accomplished by applying the acoustic analyses to the signal and looking for transitions. The transitions define segments of the signal, which can then be treated like individual sounds. For example, a recording of a concert could be scanned automatically for applause sounds to determine the boundaries between musical pieces. Similarly, after training the system to recognize a certain speaker, a recording could be segmented and scanned for all the sections where that speaker was talking. Segmentation techniques create information about spoken content that is useful for IR. Segmentation of spoken content can take place either based on direct analysis of the audio content or by using the ASR transcripts. The importance of forming audio and topical segments has been recognized in management of speech content. Segmentation using the audio content prior to recognition can help to improve the quality of the ASR transcripts. In general, ASR systems do not process a continuous string of speech, but rather disassemble it into smaller pieces, generating a hypothesis string for each. Segments created for the purposes of ASR may be of fixed length or may be divided at pauses (the speech signal drops off in energy). Segments may approximately correspond to utterances, but whether they will also be useful for SCR will depend on the application. In general, the quality of the segmentation will be strongly dependent on the segmentation algorithm used and on the nature of the audio signal being segmented.
The use of segmented word can increase precision, thus can improve the quality of those added terms. Therefore, terms for extraction are indexed with segmented word within the side collection for selection purposes. The segmentation is based on the lexicon knowledge from audio data. However, for retrieval of the spoken document, indexing is still based on overlapping character. All segmented words will also form overlapping character before added back to the spoken document.
The segmentation approaches used in the systems mentioned above and in other retrieval systems covers a wide range of methods. In general the methods can be divided into two groups: audio classification and change detection.
Audio Classification
One part of a speech retrieval system concerns identifying different audio classes. The main classes considered are speech, music, noise, and silence but depending on the application more specific classes such as noisy speech, speech over music, and different classes of noise, have been considered. The task of segmenting or classifying audio into different classes has been implemented using a number of different schemes. Two aspects that must be considered are feature and classification model selection. Different features have been proposed, based on different observations on the characteristics that separate speech, music and other possible classes of audio. The features are generally divided on basis of the time perspective they are extracted. The simplest features proposed include time domain and spectral features. Time domain features typically represent a measure of the energy. Cepstral coefficients have been used with great success in speech recognition systems, and subsequently have shown to be quite successful in audio classification tasks as well which will be discussed in section
The other aspect to be considered is the classification scheme to use. A number of classification approaches have been proposed, that can be divided into rule-based and model-based schemes. The rule-based approaches use some simple rules deducted from the properties of the features. As these methods depend on thresholds, they are not very robust to changing conditions, but may be feasible for real-time implementations. Model-based approaches have included Gaussian Mixture Model (GMM), K-nearest-neighbor (KNN), Hidden Markov Models (HMM), and the time sequence of features, or the probability model.
Speech Recognition
Speech Recognition is the process of converting speech signal to a sequence of words by means of algorithm implemented as a computer program. Speech recognition process aims to give a machine the ability to hear, understand and act upon spoken information. Speech recognition concentrates on recognition of spoken word, recognition of speakers, recognition of spoken language and recognition of emotion. Music information retrieval works on analyzing the structure of music and retrieving similar pieces of music, instruments and musical genres. Environmental sound retrieval includes types of sounds neither music nor speech. There are four basic approaches for developing automatic speech recognition system:
Template Matching: it is a whole-word matching which is the most widespread, and commercially available approach to automatic speech recognition. The method involves obtaining one or more utterances of every word that is to be recognized from one or more speakers and storing the spectrographic representations of these utterances via a voice coding (filtering and digitizing) process.A whole-word template matching recognizer can only be successful on those words with which it has been trained. With a reasonably large vocabulary, it is obviously inefficient.
Stochastic Approaches: is the process of making a sequence of non- deterministic selections from among sets of choices. They are non-deterministic because the selections are not specified in advance but governed by the characteristics of the input.
Speech recognition can be viewed as a situation where selections have to be made based on uncertain information and stochastic modeling is a flexible and general method for handling situations where uncertainty overcomes. Like template matching, stochastic processing requires the creation and storage of models of each of the items that will be recognized. However, unlike template matching stochastic processing involves no direct matching between stored models and input. Instead, it is based upon complex statistical and probabilistic analyses that are best understood by examining the network like structure in which those statistics are stored. The hidden Markov model (HMM) is one popular stochastic approach that can be used to deal with speech recognition problems.
Neural Networks: are described as connectionist systems because of the connections between the individual processing nodes. They are also known as adaptive systems, because the values of these connections can change so that the neural network performs more effectively. People also refer to them as parallel distributed processing systems after the way in which the many nodes or neurons in a neural network operate.Neural Networks are best classification systems. They specialize in classifying noisy, patterned and variable data streams containing multiple, overlapping, interacting, and incomplete signals. Speech recognition is a classification task that has all of these characteristics, making neural networks a possible alternative to speech recognition.
Like stochastic and template matching techniques, neural networks require training. However, neural networks do not require that a complete specification of a problem be created prior to developing a network-based solution. Instead, networks learn patterns solely through exposure to large numbers of examples, making it possible to construct neural networks for auditory models and other poorly understood areas.
Neural networks use manually-segmented or labeled speech elements and often store speech items as whole units rather than as time sequences.Knowledge-Based Approaches: The main idea is to compile and incorporate knowledge from a variety of knowledge sources so that the system imitators a human being.
The knowledge sources that are considered useful include:
Acoustic knowledge – evidence of which sounds are spoken on the basis of spectral measurements and presence or absence of features.
Lexical knowledge – the combination of acoustic evidence so as to nominate words as specified by a lexicon that maps sounds into words or equivalently decomposes words into sounds.
Syntactic knowledge – the combination of words to form grammatically correct strings, according to a language model, such as sentences or phrases.
Semantic knowledge – understanding of the task domain so as to be able to validate sentences or phrases that are consistent with the task being performed, or which are consistent with previously decoded sentences.
Pragmatic knowledge – inference ability necessary in resolving ambiguity of meaning based on ways in which words are generally used.All these knowledge sources interact in order to arrive at a decision about a speech sample. The crucial question, with such a complex scheme is “how do these levels interact?”.
Automatic speech recognition (ASR) has the following components.
Front End: The front end extracts data from the digital representation of the spoken words, setting it into a form the decoder can use. To make pattern recognition easier, the digital audio (spoken words) is transformed into frequency domain. In the frequency domain, frequency components of a sound can be identified. From the frequency components, it is possible to approximate how the human ear distinguishes the sound. The transformation results in a graph of the amplitudes of frequency components, describing the sound heard.
Acoustic model: is building statistical models for some meaningful speech units based on the feature vectors computed from speech. The speech signal is first chunked into overlapping 20-30ms time windows at every 10ms and the spectral representation is computed from each frame. A commonly used feature vector consists of mel-frequency cepstral coefficients (MFCC). The meaningful units of speech that are most often used in ASR are phones. Phones are the minimal units of speech that are part of the sound system of a language, which serve to distinguish one word from another. The dominant approach to acoustic modeling in speech recognition is to use Hidden Markov Models (HMMs). An alternative to the standard HMM approach is a hybrid approach in which Artificial Neural Networks (ANN) and HMMs are employed. In order to recognize speech, the acoustic models should first be trained. During training, the parameters for the models are estimated from recorded speech material which has been orthographically transcribed (i.e., at word level). Moreover, a phonetic transcription of the words is needed. Transforming a word sequence to a phone sequence is accomplished by looking up the phonetic transcription for a word in the lexicon.
Pronunciation Model: consists of the orthography of words that occur in the training material and their corresponding phonetic transcriptions. It specifies the finite set of words that may be output by the speech recognizer. The transcriptions can be obtained either manually or through grapheme-to-phoneme conversion. A pronunciation dictionary can be classified as a canonical or alternative on the basis of the pronunciations it includes. For each word, a canonical pronunciation dictionary includes only the most probable pronunciation that is assumed to be pronounced in read speech. It does not consider pronunciation variations such as speaker variability, dialect, or co-articulation in conversational speech. On the other hand, an alternative pronunciation dictionary includes all the alternate pronunciations that are assumed to be pronounced in speech.
Language Model: which includes the dictionary, is used by the decoder to determine the most likely suggestion. The language model describes to the decoder, the relationship between words and the probability of words appearing in a particular order. Language models may be domain specific. The language model is referred to as the grammar when referencing context free implementations. Its goal is to predict the likelihood of specific words occurring one after another in a given language. Typical recognizers use n-gram language models, most commonly trigram. An n-gram contains the prior probability of the occurrence of a word (unigram), or a sequence of words (bigram, trigram etc.): unigram probability P(wi)bigram probability P(wi|wi−1)ngram probability P(wn|wn−1,wn−2, …,w1).
Decoder: The acoustic model is produced or prepared by the ASR training tools. The trainer processes audio files with their text transcriptions to extract common acoustic characters for each individual phoneme (context independent phoneme) as well as each phoneme with its context (context dependent phoneme). Actually, each phoneme has been divided into 5 states in a time sequence, since the characters are slightly different from begin to end. All phoneme-state based characters form a voice model.Hence, the decoder is the speech engine that takes the acoustic model, the pronunciation model, the language model, and observation sequence and outputs the most likely sequence of words.
Feature Extraction: converts the speech signal into a sequence of acoustic feature vectors. The goal is to extract a number of features from the signal that have a maximum of information relevant for classification. That means features that are robust to acoustic variation but sensitive to linguistic content are extracted.