The Use Of Subword-Based Audio Indexing In Chinese Spoken Document Retrieval

This research is based on local Hong Kong television news broadcasts in Cantonese, the predominant Chinese dialect used in Hong Kong, Macau, South China and many overseas Chinese communities. Cantonese is monosyllabic with a rich tonal structure which are approximately 1,600 distinct tonal syllables with six lexical tones. The syllable unit seems very desirable for Chinese spoken document retrieval, by virtue of the monosyllabic nature of the language and its dialectal variations. Hence, syllables can fully characterize the language and provide high phonological coverage for spoken documents.

According to LI Yuk Chi, there are several thousand unique characters in the Chinese language that are used for the Chinese writing system. Unlike Amharic, the definition of a Chinese word is vague, as there is no explicit word delimiter in Chinese. Hence given a Chinese sentence (i.e. sequence of characters), they need to perform the procedure of word tokenization in order to segment the character sequence into a word sequence. Word tokenization involves referencing a Chinese word lexicon in order to segment into words. The inherent ambiguity in Chinese word tokenization creates a problem for Chinese information retrieval. If a given word appears in both the query and the document but it is not segmented as such, there will also be a failure in matching the word during retrieval. This suggests that word-based indexing for Chinese retrieval may be problematic because the "correct" word is difficult to obtain. As a result, some have adopted the technique of subword-based indexing, i.e. indexing based on Chinese characters instead of words. Indexing solely based on single characters lose the sequential information which captures lexical information. Overlapping characters to retain some sequential constraints for retrieval where, Sequential constraints offered by overlapping character bigrams achieve comparable retrieval performance as Chinese word-based retrieval. Unlike Amharic, a Chinese word does not have much inflectional variations, hence stemming is generally unnecessary. Stop word removal for Chinese is possible. However, due to the ambiguity word tokenization, a Chinese character which constitutes a stop word in one context may be part of a content word in another context. As a result, stop word removal may not be applied in Chinese information retrieval.

The author uses overlapping Chinese syllable n-grams for indexing Chinese spoken documents. Indexing Chinese audio in terms of syllables imply the use of syllable recognition technology. If the audio document representation is in terms of syllables, the textual queries need to be transformed into syllables as well, and the retrieval mechanism should also involve matching based on syllables. Transformation of textual queries into syllables may be accomplished by pronunciation dictionary lookup. It should be noted that syllable errors from speech recognition are present in the audio indices, but are absent from the query transformation. This creates a mismatch in retrieval, i.e. the queries are "clean" but the documents are "erroneous".

The Chinese spoken document retrieval involves the use of a query (spoken query or textual query) to retrieve relevant documents from a spoken document collection like audio tracks from video clips and audio clips from radio broadcast. The author interested in the use of Chinese textual queries to retrieve Chinese spoken documents; where a list of documents are retrieved, and they are ranked according to their degree of relevance. The author maps the text into syllables by using pronunciation dictionary lookup for matching during retrieval. The author proposes to perform Chinese spoken document retrieval based on subword units, namely, syllable n-grams and to enhance retrieval engine by robust techniques, the author investigate two techniques: query expansion and document expansion. The author also used a vector space model for retrieval.

Chi evaluates the retrieval performance based on the ranked retrieval list output from the retrieval engine and used the Average Inverse Rank and the Mean Average Precision as evaluation criteria. The use of subword (syllable-based) indexing for Chinese spoken document retrieval includes:

The incorporation of sequential constraints in syllable-based audio indexing by means of overlapping syllable n-grams, and the contribution of such constraints towards retrieval performance.

The use of tone information that compare retrieval results between indexing with base syllables (tone excluded) and indexing with tonal syllables (tone included).

The Cantonese spoken documents are derived from a video archive of news broadcasts. Each video clip is accompanied by a short textual summary which is very brief in contents, and it is not a transcription of the audio track. The text is in Big5 encoding, which is the major encoding method used in Hong Kong. The textual summaries average 150 Chinese characters in length, and vary between 140 to 1,700 characters. The video clips average 1.5 minutes in duration, and vary between 11 seconds to 25 minutes.

The author transformed textual query in Big5 format into Cantonese syllables by looking up a pronunciation dictionary; indexed the audio tracks with a Cantonese syllable recognizer by using monosyllables, overlapping syllable bigrams, overlapping syllable trigrams and skipped syllable bigrams; and then estimated the Chinese word-based text retrieval benchmark to be Average Inverse Rank (AIR)=0.834. The use of overlapping syllable bigrams (tonal syllables) deliver a comparative performance with AIR=0.830. Results based on speech recognition outputs using overlapping syllable bigrams without tone information gave AIR=0.479. The strength of this research work was use of different alternatives like monosyllables, overlapping syllable bigrams, overlapping syllable trigrams and skipped syllable bigrams to index the audio information of Chinese language, but this cannot be effective due to the inefficient method of automatic speech recognizer; it is better to use feature extraction methods like MFCC, etc.

11 February 2020
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now