Bangla Content Categorization Using Text Based Supervised Learning Methods
The widespread and increasing availability of text documents in electronic form increases the importance of using automatic methods to analyze the content of text documents. Speciﬁcally, there is a great development in Bangla content generation due to greater number of users in the recent years in social media. In this paper, we present a supervised learning based Bangla content classiﬁcation method. We have created a large Bangla content dataset and made it available for use publicly. This dataset was tested using several machine learning algorithms using text based features. Our experiments showed logistic regression worked best compared to other algorithms.
Text categorization is an active research area of text mining where the documents are classiﬁed with supervised, unsupervised or semi-supervised knowledge. Traditionally, this task is solved manually, but such manual classiﬁcation is expensive to scale and also labor intensive. Besides, Sentiment analysis or opinion mining has been quite popular and has led to building of better products, understanding user opinion, executing and managing of business decisions.
Bangla is the sixth-most popular language in the world and spoken by a population that now exceeds 250 million. It is the primary language in Bangladesh and second mostly spoken language in India. However, its spread is accelerating with the massive increased usage of social media, where they can share their views and opinions regarding any topic of interests. This results in huge volumes of user-generated information on the micro blogging sites, which are utilized for many applications. This information later come in handy for product review mining where companies analyze the reviews provided by the consumers and decide which product should be improved and take decision regarding product sales. Vice versa the consumer goes through the reviews of previous other consumers and decides what to or not to buy. For completion of this entire procedure, millions of reviews need to be analyzed. This is where text mining makes the work easier.
Among various machine learning approaches in document categorization, most popular is supervised learning where underlying input-output relation is learned by small number of training data and then output values for unseen input points are predicted. Various numbers of supervised learning techniques, such as Neural Network, K-Nearest Neighbor, Decision Tree, Nave Bayes, Support Vector Machine, and N-grams has been used for text document categorization.Although text categorization is well studied in other languages for a long time, there are only recent advances in Bangla. Among a few works in Bangla document categorizations are: N-gram techniques, Naive Bayes Classiﬁer and Stochastic Gradient Descent, etc. However, we have observed that most the work in the literature lack annotated corpora. Most of the methods are not comparable to each other since they used different datasets and do not share them publicly. To add, the size of the datasets were also not large enough. Moreover, there methods are also not available for use later, and comparison becomes quite impossible. Our work is motivated from these observations.
In this paper we intend to categorize Bangla documents. We extracted articles from the top news article provider in Bangla for a period of three months and created a large dataset. Several supervised machine learning techniques were used to classify these articles using text based features. Among all the classiﬁers tested, logistic regression was superior to others. We have also developed a web application based on our method. We have made our data extraction tool and the datasets available for use by the other researchers. The main contribution of this paper is enumerated in the following:
- Creating a large Bangla document dataset publicly available.
- A publicly available tool for extracting Bangla articles from news provider websites.
- A classiﬁcation method for classiﬁcation of Bangla documents.
- A publicly available Tool for Bangla content categorization.
Most frequent techniques used for text categorization are mainly K-Nearest Neighbor (KNN), Naïve Bayesian Classifier (NB), Decision Tree (DT), Neural Network (NNet) and Support Vector Machines (SVM). We also have tried some old and new algorithms. We have used 6 algorithms for classification and later compared which classification technique provides the best result. Below a brief description about the 6 algorithms are given.
Algorithm 1: K-Nearest Neighbor (KNN)
KNN algorithm is one of the simplest classification algorithms. It is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point. It is basically called a lazy learner because KNN does not use the training data points for any generalization. KNN is based on feature similarity. How closely a sample feature point resembles the training set determines how we classify a given data point. A pseudocode of the algorithm is given below.
Algorithm 2: Gaussian Naïve Bayes
Because of the assumption of the normal distribution, Gaussian Naive Bayes is best used in cases when all our features are continuous. Before diving straight to Gaussian NB, we must first take a look at the Naive Bayes probabilistic model.Mathematically, give a dataset to be classified, NB assigns to an example (dataset feature) a discrete probability,)For K-classes in the dataset. To learn this multivariate distribution would require a large amount of data. Thus, to simplify the task of learning, we assume that the features are conditionally independent from each other given the class. Consequently leading to the use of Bayes' theorem,Translating to plain English, the above equation may be understood by=By conditional probability, the numerator is just the joint probability distribution and may be factored through chain rule,Now, through the assumption of conditional independence of features, i.e. each feature is conditionally independent from every other feature for , we getLeading us to the expression of the joint probability model as,When the data at hand is continuous data, the assumption is that the continuous values for each class are distributed according to a Gaussian distribution. Recall that the probability density function of the normal (Gaussian) distribution is given byWhere represents the variance of the values in x, while μ represents the mean of the values in x.So, for Gaussian NB, suppose we have a training data which consists of continuous attribute x, we shall segment the data by class. Then, we compute the mean μ and the variance of x per class. We let μk be the mean of the values in x for class Ck, then it follows that we let be the variance of the values of x for class ck.Now, assume we have collected some observation values xi. Thus, we have the probability density for xi for class Ck as.
Algorithm 3: Support Vector Machine (SVM)
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a plane in two parts where in each class lay in either sideFor a dataset consisting of features set and labels set, an SVM classifier builds a model to predict classes for new examples. It assigns new example/data points to one of the classes. If there are only 2 classes then it can be called as a Binary SVM Classifier.
There are 2 kinds of SVM classifiers: Linear SVM ClassifierNon-Linear SVM ClassifierSVM Linear Classifier:In the linear classifier model, we assumed that training examples plotted in space. These data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to the nearest data point of either class. The drawn hyperplane called as a maximum-margin hyperplane.In Linear Classifier, A data point considered as a p-dimensional vector (list of p-numbers) and we separate points using (p-1) dimensional hyperplane. There can be many hyperplanes separating data in a linear order, but the best hyperplane is considered to be the one which maximizes the margin i.e., the distance between hyperplane and closest data point of either class.The Maximum-margin hyperplane is determined by the data points that lie nearest to it. Since we have to maximize the distance between hyperplane and the data points. These data points which influences our hyperplane are known as support vectors.
Non-Linear Classifier: In the real world, our dataset is generally dispersed up to some extent. To solve this problem separation of data into different classes on the basis of a straight linear hyperplane can’t be considered a good choice. For this creating Non-Linear Classifiers by applying the kernel trick to maximum-margin hyperplanes. In Non-Linear SVM Classification, data points plotted in a higher dimensional space.
It often happens that our data points are not linearly separable in a p-dimensional (finite) space. To solve this, it was proposed to map p-dimensional space into a much higher dimensional space. We can draw customized/non-linear hyperplanes using Kernel trick. Every kernel holds a non-linear kernel function.This function helps to build a high dimensional feature space. There are many kernels that have been developed. Some standard kernels are:
- Polynomial (homogeneous) Kernel: The polynomial kernel function can be represented by the above expression. Where k (xi, xj) is a kernel function, xi & xj are vectors of feature space and d is the degree of polynomial function.
- Polynomial (non-homogeneous) Kernel: In the non-homogeneous kernel, a constant term is also added. The constant term “c” is also known as a free parameter. It influences the combination of features. x & y are vectors of feature space.
- Radial Basis Function Kernel: It is also known as RBF kernel. It is one of the most popular kernels. For distance metric squared Euclidean distance is used here. It is used to draw completely non-linear hyperplanes. Where x & x’ are vectors of feature space. Is a free parameter. Selection of parameters is a critical choice. Using a typical value of the parameter can lead to overfitting our data.
Random forest algorithm is a supervised classification algorithm. The best advantage of this algorithm is that it can be used for both classification and regression problems. It mainly works by creating a forest with some number of trees. The robustness of this algorithm depends on the number of trees in the forest. The higher the number of trees, the higher is the accuracy in this algorithm. Random forest algorithm uses decision tree concept. Decision trees are instinctive models that uses top down approach, where the root creates binary splits until stopping criteria is met. The binary splitting of nodes provides a predicted value depending on the internal nodes leading to the terminal nodes. In case of a classification problem, a decision tree outputs a predicted target class for ech of the produced terminal nodes. If the training dataset with targets and features are given, the decision tree algorithm will provide with some set of rules. Those set of rules can be used to perform the prediction on the test dataset.In decision tree algorithm calculating nodes and forming the rules are usually done using the information gain and Gini index calculations. But, in random forest algorithm, the process of finding the root node and splitting the feature nodes will happen randomly. The random forest algorithm works maintain two stages. The first stage is creating the random forest. And the second stage is performing prediction for the created random forest classifier.