The Main Approaches Of Language Identification
This section is concerned with different language identification methods. Language identification is the process of identifying the language of a speech or text of a document. Language differences are only part of the observed differences arising from speakers, uttered messages and environmental conditions. In the following, we briefly describe the major language identification tasks.
The task of language identification serves in various areas of Natural Language Processing. Language identification program provides applications like: E-mail routing and filtering engines, Text mining applications, Identification of the language or encoding of WWW pages, Information retrieval systems, Content based and language specific web crawlers and search engines . With the increasing number of different languages in the Internet, the importance of language identification also grows. Language identification is also needed for morphological processing of document. The language of a document needs to be known before techniques such as stemming or parsing can be applied. Also if a lexicon is used for spell checking, the program needs to know which language specific rules to apply. While there are many methods for performing language identification discussed as follows, an N-gram approach has proven effective.
Common Words and Unique Letter Combinations
The main idea of the common words approach is to store highly frequent words for each language in a database. The document to be classified is then compared to all the word lists in question. By the use of a scoring system the word list with the most occurrences in that document indicates the language of the document. This seems to be a good approach for large documents. Several drawbacks have to be mentioned however. Especially for short texts, i.e. input of less than ten words, the probability is quite low that one of the common words occurs in the input. This leads to misclassification of the language. Also for languages as Amharic, where morphologically complex and tokenization is not easily possible, this approach won’t work in the desired way. An advantage of the common words approach is that it is easy to implement and that it requires less computational power than a tri-gram approach as there are fewer calculations to be done.
Statistical Approach
Statistical Approach uses Markov Models to calculate the probability that a document originated from a given language model. For statistical language identification a set of character level language models is prepared from training data, i.e. sample texts in different languages as a first step. This is done by segmenting the strings and entering them into a transition matrix which contains the probabilities of the occurrence of all character sequences.
In Language Identification with Markov Models the probability is calculated that a document derives from one of the existing language models, i.e. the probability that a String S occurs being from an alphabet X. In the example below the probability of string S is defined by the probability of the sequence of characters s1 . . . sn. p(S) = p(s1 . . . sn) = p(s1) ∏_(i=2)
n▒〖p(s_i |s_(i-1))〗 (2.14) where we have an initial state distribution p(si) and transition probabilities si−1. Changing the formula, introducing a fixed context of k states, we get a Markov model with limited history. Now only the last k previous characters are taken into account. The new initial states are now p(s1 . . . sk) and the transition probabilities p(si+k | si . . . si+k−1). For language identification we can now create a language model, where the occurring strings show the distribution in that language. The alphabet X is the set of characters in that language.
Now given our observation X, one needs to find out which phenomenon caused it, the phenomenon being the language to be found (the language model). Given two phenomena A and B and an observation X the task is to find out which of them caused the observation. Using Bayes Theorem, we get: p(A, X) = p(A | X)p(X) = p(X | A)p(A) (2.15) which can be reduced to p(A | X) = p(X | A)p(A) (2.16) as the observation p(X) = 1 occurred as we do not know the prior probability of A, we assume a uniform probability for A and B. Now the probability that a string S has been produced by the language model A can be computed as follows: p(S | A) = p(s1 . . . sk | A) Y N i=k+1 p(si | si−k . . . si−k | A) (2.17) Rearranging equation 4 gives us the following formula to calculate the probability of string S given the language model A.log p(S | A) = X w1...wk+1²S T(w1 . . . wk+1, S)logp(wk+1 | w1 . . . wk | A) (2.18) where T(w1 . . . wk+1, S) stands for the number of times that the k+1 gram w1 . . . wk+1 occurs in S. The probability p(S | language model) is computed for all models. The highest probability indicates the model that produced the string.
Compression based approach with prediction by partial matching
Another method is the Compression Based Approach with prediction by partial matching. In this method sample text from various languages are compressed using the PPM (prediction by partial matching) algorithm. A document to be identified is then also compressed and the number of bits needed to encode this document is compared to the bits in the existing language models. The main idea of PPM is to predict an upcoming character using the set of previous characters with a fixed context. Each upcoming character is assigned a probability. The amount depends on whether the character has appeared before in a sequence with the previous characters. If a certain sequence has not been observed so far, it uses the escape probability and the context size is reduced by one. The escape probability for a sequence receives the number of times the sequence has been observed being followed by a character.
In the lower order model again, a check is conducted to see, if a sequence with that character at the end exists. If the character has never been seen in a sequence the context size is set to -1. At this level all characters are predicted equally. This means the character receives a probability, which is calculated by 1 divided by (the total numbers of characters - the already occurred characters).