Sentiment Analysis Of Bengali Sports News Comments
Sentiment analysis is the emerging topic in the field of natural language processing. It focuses on identifying and categorizing positive, negative or neutral polarity of text towards a particular topic. Most of the research work on sentiment analysis is in the English language. Bengali corpus is increasing day by day. A large number of online News portals publish their articles in Bengali language and a few News portals have the comment section that given the opportunity to express the opinion of people. In this report, a research work has been done on Bengali Sports news comments published in different newspapers. We have collected comments and separate them into the sentence based on sentiment. Almost we have completed our task. After collecting our data at first pre-processing task of Punctuation marks removing then tokenization sentence, stop-words removing etc. We apply various types of machine learning algorithm. Such as Naïve Bayes, Multinomial NB, Bernoulli NB, Logistic Regression, LinearSVC, NuSVC, SGDC Classifier.
Introduction
Sentiment analysis is the process of determining the emotional biases behind a series of words, used to achieve an understanding of the attitudes, opinions, and emotions expressed within an online mention. In general, sentiment analysis aims to determine the attitude of a speaker, writer, or another subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction or event. Sentiments are inherently subjective; different people may interpret the attitude of the same text differently. It is extremely useful in social media monitoring as it allows us to find an overview of the wider public opinion behind certain topics.
The Internet is the global platform to share opinion. Thousands of people share their opinion on social media and blogs, post reviews on different products and services. Online news portal publishes their articles in different categories. In such a way, online text content is growing rapidly. People need to take a decision by analyzing products reviews, news articles, and social media posts and so on. It is also important to know individual journalists, and columnist's opinion and public opinion on an issue like sports, business, and trade, international, national, and some other important issues. It is hard to analyze the opinions and feelings of huge news content manually. Sentiment Analysis focuses on determining whether a piece of text is positive, negative or neutral. It's also known as opinion mining, extracting the opinion or feelings of a writer. It is a process to detect someone's attitude towards a particular product or services.
Sentiment analysis is important to make decisions on topics like politics, sports, financial condition; product reviews and so on judgmental issues. Humans are subjective in nature and that's why opinions are important. Business organizations, Government, Service-providing organizations, Sports lover analyze public opinions to identify opportunity, take decision and make progress. The buyer wants to read reviews when to buy a product, Sports organization need to know public opinion to know the expectation of the audience, Government needs to know public opinion before making policy, and even psychological investigation requires sentiment information. So, decision makers are being depended on using the content of online media like- news, reviews, micro-blogs, and postings on social site. Thus, besides individuals, companies those are anxious to understand how their products and services are perceived should have a system for automatically analyzing consumer sentiment expressed in online media.
The newspaper shows the current happenings of the world and can be called a mirror of the society Sports section of news portal shows news about Football, Cricket, Tennis, Hokey, Sports club, sports team leaders & other players, score, and sports issues. People read sports news to know the activity of sports scores, player’s activities, sports organization policy, aim to achieve the goal in the future, local or national or international matters that will affect them. In some online news portal have comments sections where people can express their opinion that is helpful to know about others opinion on a specific topic. Today, People want to know which topic in the country or world is carrying the most positive sentiment and which is carrying the most negative sentiment. Sentiment analysis can find the polarity of public opinion towards any topic. This study may help people to find current the condition of sports issue of any country and time. Various statistical and linguistic techniques have been developed for the Sentiment analysis. All these methods are applied to the English language and there is a huge scope to work with the Bengali language. In our study, we will use the popular Naïve Bayes and Support Vector Machine classifier to detect sentiment of Bengali sports news comments.
Related Previous Works
Horoscopes consist of future predictions for each of the twelve zodiac signs and are very popular in India. Tirthankar Ghosal and Sajal K. Das mainly focus on the sentiment analysis of Bengali daily horoscope using SVM with unigram features on the paper. They are given positive and negative emotion basis of the sentence by crawling a leading Bengali newspaper’s daily horoscope section. Researchers have extracted positive (thumbs up) and negative (thumbs down) by using unsupervised machine learning. Turney (2002) Liu (2012) Bing Liu has described in this paper about all the past works different method of Sentiment classifications such as supervised, unsupervised, origin of Sentiment Analysis, the different domain of sentiment analysis like movie reviews, political data, news article etc.
Md. Zahurul Islam and Naushad UzZaman present the compilation methodology and some statistical by observing of a typical behavior of Zipf’s curve for Bangla news corpus - “Prothom Alo corpus”, which is the first of its kind for Bangla. Amandeep Kaur and Vishal Gupta. Described in this paper about the survey on main approaches for performing sentiment extraction: Subjective lexicon, Using N-Gram modeling and Machine learning.
Shaika Chowdhury and Wasifa Chowdhury(2014) By using Support Vector Machine (SVM) and Maximum Entropy (MaxEnt) they extract the sentiments or opinions from Bangla microblog posts and then identify the overall polarity of texts as either negative or positive and do a comparative analysis where got the best score for SVM.
Tanzir Altaf and Sabir Ismail (2016) used feature sets and supervised classifier and proposed a method to recognize the sentiment or opinion and extract a unique feature to come out a better approach to understanding sentiment from Bangla text using. Sentiment Analysis of Bengali language has evolved over time. Much of the work has been done by the Indian author Amitava, and Sivaji. They have collected data, found subjectivity of a sentence; Das and Bandyopadhyay (2011), Das and Bandyopadhyay (2010) are much of their work. Our seniors have also done some work in SA during their undergraduate thesis. They have tested Bangla text with cosine similarities using TF-IDF, naive Bayes with POS tagger, stemmer. Some of them have worked on news article sentiment analysis. They have also done the preprocessing of data. But still, SA in Bangla language in Bangladesh needs more important to get a better result through which people in Bangladesh can get the real benefit of it.
Methodology
Data Set Collect
To start a Sentiment Analysis process, it is always required to build a sentiment lexicon and annotated data for machine learning. The details of resource acquisition are described below. Web content is increasing in Bangla, the resources and data are increasing thus giving the researcher's chances to analyze and extract information from these data. We have visited more than 24 Bengali newspaper but got only 5 Bengali newspaper with the comment section. Most of them have no comment, only Prothom-alo newspaper have huge comments. We have collected our data from the popular Bengali newspaper Prothom-alo. We have collected data using a web crawler.
All the comments have been divided into sentence. We found that different paragraphs in political news articles show different sentiment.
We have annotated our data in three category positive (p). Negative (n) and neutral (u). We have annotated data manually. Sentences that expresses hope, happiness, gratitude, patriotism, affection to novel etc. is annotated as positive label and Sentences that expresses hate, frustration, complaint etc. is annotated as negative label. Then, we have eliminated the neutral sentences.
Data Preprocessing and Normalizing
As we have collected our data from news portal using a web crawler in the 'UTF-8 encoding, there was another problem related to Bangla text. We had to collect all the text data leaving all the HTML tag from the news corpus. Our total work can be divided into two parts.
Results
The classifications methods can be found out the performance by using some of the following parameters: precision, recall, and accuracy are explained using four terms - true positive, true negative, false positive and false negative.
- True Positive (tp) is defined as the number of sentences, from the test set, correctly labeled by the classifier as belonging to a particular class or label.
- True Negative (TN) is defined as the number of sentences, from the test set, correctly labeled by the classifier as not belonging to a particular class or label.
- False Positive (FP) is defined as the number of sentences, from the test set, incorrectly labeled by the classifier as belonging to a particular class or label.
- False Negative (fn) is defined as the number of sentences from the test set, that are not labeled by the classifier as belonging to a particular class or label but should have been.
Discussions
We annotated the comments data carefully but the annotation of data was a bit of noisy and confusing. Because there are such sentences which contain positive words as well as negative words. So, annotating those data was quite problematic for us. In this case, we annotated the data according to our insight. Because of this noisy data accuracy may be decreased a little bit. Moreover, classifying this huge document needs more insight into data which needs to extract more features. Acquiring those features needs to manipulate more data which needs a higher performing system. In the future, we will try to collect more data to get a more accurate result.
Conclusion
Our main goal is to design or build a Bengali corpus. So we were searching different types of field. At last, we notice that sports news public comments is a unique section of corpus design although this types of sentiment analysis work exist. This work focuses only on the Bengali sports comments taken from Prothom-alo. By using this data we do our experiment using 7 different algorithms. We get maximum accuracy 70. 75% for Multinomial Naive Bayes Classifier. By collecting more data we can build a large Sentiment analysis corpus on Bangla Text. We only use a little part of our data that we collected. But future we will use full data and try to add more data. This work can be extended on mining reviews from other fields of newspapers.