The Concept Of Segmentation Of Spoken Content
Segmentation is accomplished by applying the acoustic analyses to the signal and looking for transitions. The transitions define segments of the signal, which can then be treated like individual sounds. For example, a recording of a concert could be scanned automatically for applause sounds to determine the boundaries between musical pieces. Similarly, after training the system to recognize a certain speaker, a recording could be segmented and scanned for all the sections where that speaker was talking. Segmentation techniques create information about spoken content that is useful for IR. Segmentation of spoken content can take place either based on direct analysis of the audio content or by using the ASR transcripts.
The importance of forming audio and topical segments has been recognized in management of speech content. Segmentation using the audio content prior to recognition can help to improve the quality of the ASR transcripts. In general, ASR systems do not process a continuous string of speech, but rather disassemble it into smaller pieces, generating a hypothesis string for each. Segments created for the purposes of ASR may be of fixed length or may be divided at pauses (the speech signal drops off in energy). Segments may approximately correspond to utterances, but whether they will also be useful for SCR will depend on the application. In general, the quality of the segmentation will be strongly dependent on the segmentation algorithm used and on the nature of the audio signal being segmented.
The use of segmented word can increase precision, thus can improve the quality of those added terms. Therefore, terms for extraction are indexed with segmented word within the side collection for selection purposes. The segmentation is based on the lexicon knowledge from audio data. However, for retrieval of the spoken document, indexing is still based on overlapping character. All segmented words will also form overlapping character before added back to the spoken document.
The segmentation approaches used in the systems mentioned above and in other retrieval systems covers a wide range of methods. In general the methods can be divided into two groups: audio classification and change detection.
One part of a speech retrieval system concerns identifying different audio classes. The main classes considered are speech, music, noise, and silence but depending on the application more specific classes such as noisy speech, speech over music, and different classes of noise, have been considered. The task of segmenting or classifying audio into different classes has been implemented using a number of different schemes. Two aspects that must be considered are feature and classification model selection. Different features have been proposed, based on different observations on the characteristics that separate speech, music and other possible classes of audio. The features are generally divided on basis of the time perspective they are extracted. The simplest features proposed include time domain and spectral features. Time domain features typically represent a measure of the energy. Cepstral coefficients have been used with great success in speech recognition systems, and subsequently have shown to be quite successful in audio classification tasks.