The Process Of Feature Extraction In Speech Recognition
Feature extraction is the process of obtaining sequence of vectors that is used to represent acoustic data in which input sound is digitized (sampled) with sampling rates 8 KHz for telephone speech and 16 KHz for direct microphone output. A signal is sampled by measuring its amplitude at a particular time; the sampling rate is the number of samples taken per second. In order to accurately measure a wave, it is necessary to have at least two samples in each cycle: one measuring the positive part of the wave and one measuring the negative part. More than two samples per cycle increases the amplitude accuracy, but less than two samples will cause the frequency of the wave to be completely missed. Audio feature extraction deals with the analysis and extraction of meaningful information from audio signals in order to obtain a compact and expressive description. It transforms the input waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. The speech signal is transformed into a sequence of acoustic feature vectors before it is processed by the speech recognition system. The vectors contain spectral features, which encode how much energy is present at different frequencies in the speech signal. These vectors are extracted from overlapping windows of the speech signal. Each vector is extracted from a signal window that is small enough to support the assumption that the speech signal is stationary (non-changing) in its duration. A typical window is 25ms in length and for each vector, the window is shifted forward by 10ms. This overlap seeks to ensure that rapid changes in the input signal are captured in the feature vectors. In general, acoustic vectors are 39 components in length. Commonly, the components are mel-frequency cepstral coefficients (MFCCs). Mel frequencies are frequency bands warped to approximate the sensitivity of the human ear. Perceptual linear prediction is also commonly used for spectral vectors, this is a linear prediction method that retains information in the signal that is relevant for human perception.
Feature extraction can be used to characterize a segment of audio by computing a numerical representation. This numerical representation, which is called the feature vector, is used as a fundamental building block of various types of audio analysis and information extraction algorithms. This vector has typically a fixed dimension and therefore can be thought of as a point in a multi-dimensional feature space. When using feature vectors to represent audio two main approaches are used. In the first approach the audio file is broken into small segments in time and a feature vector is computed for each segment. The resulting representation is a time series of feature vectors which can be thought of as a path of points in the feature space. In the second approach a single feature vector that summarizes information for the whole file is used. The single vector approach is appropriate when overall information about the whole file is required whereas the trajectory approach is appropriate when information needs to be updated in real time. For example, classification of radio signals might use the trajectory approach. Typically the signal is broken in small chunks called analysis windows. Their sizes are usually around 20 to 40 milliseconds. That way the signal characteristics are relatively stable for the duration of the window.
Feature extraction is achieved using Time-Frequency analysis technique such as the Short Time Fourier Transform (STFT). Time-Frequency analysis technique basically represent the energy distribution of the signal in a time-frequency plane and differ in how this plane is subdivided into regions.
Features based on the Short Time Fourier Transform (STFT) are very common and have the advantage of fast calculation based on the Fast Fourier Transform (FFT) algorithm. Although the exact details of the STFT parameters used to calculate them differ from system to system their basic description is the same.