Improving Precision In Information Retrieval For Kafinoonoo Language Using Stemming

Kafi-noonoo language is one of omotic family which is morphologically complex like its neighbors Wolayta. As we saw in word distribution, Kafi-noonoo was more complex than English and Afan-Oromo language. That means variety form of a single word can be thriving as much degree as from the above two languages. So, conflating technique like stemming should necessary for such like language. In this thesis two methods were separately used.

Primarily, rule based method was applied on 2075 number of words and 87. 5% accuracy was maintained with dictionary reduction of 40. 8%. This shows that using a stemmer for Kafi-noonoo brings a significant reduction in dictionary size as a result of conflating variant words to the same stem and the accuracy also encouraging. The results obtained from the experiment were promising and using the stemmer in IR system of the language improve the performance of the system as we saw at the end. Under rule-based, rules were clustered into 7 categories based on conditions. The condition may rely on the original word or simple conditions on the stem (like a stem ends with double vowel?). If the conditions were fulfilled in a specific rule, actions were taken either simple removing the suffix or changing with other letters. The technique which used in this paper was iterative longest match (iteratively strip the suffix as the criterion fulfilled, If a word ending matches more than one ending pattern then the longest matching ending pattern is used). Since, Kafi-noonoo has no any prefixition at all, so the actions were applied on a given word suffix only.

Indeed, infixes rarely occurred in Kafi-noonoo language however, treating infix was not included in this research because of its difficulty and time limitation. In rule based method, some techniques were adopted from Porter for developing the stemmer (clustering the rule, measure etc). While applying the proposed stemmer, two categorized errors were generated namely, under stemming and over stemming errors. From the experiment of the stemmer, the number of under stemmed covers 4. 27%, and over stemmed holds 7. 4% from the given corpus. In this work, rules are not enough when compared with the language’s complexity. This is due to the limitation of time and complexity of the language. Secondly, for comparison purpose N-gram stemming also applied but the accuracy count 82. 5% which is less than rule-based method. This shows that rule based is better than N-gram for Kafinoonoo language. Especially, when more rules integrated the accuracy might increase. The obstacle in N-gram stemming is that, there was high probability to be a similar value for unrelated words. Generally, the stemmer conflates only inflectional and derivational affixes. It does not conflate compounding and irregular forms because of language complexity and time limitation.

Finally, we check our stemmer on information retrieval environment. We did the evaluation after and before stemming. As we got; kafi-noonoo text retrieval performance was better in terms of recall with that of after stemming. Indeed; precision shows decrement from the initial value. Since, stemming improves recall value with the cost of precision. Finally, we deduce that Kafi-noonoo stemmer had a great effect for the effectiveness of text retrieval (especially, to improve recall values).

18 March 2020
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now