Application Of Parallel/Distributed Data Mining Algorithms For Text Classification
Data Stratification Scrutiny is the critical examination of how the data which is now available in abundance of every field can be stratified i. e. classified. While considering different text classification techniques, the key challenge lies in its overall performance. In this paper we discuss the need of using parallel/distributed data mining algorithms for text classification. k-NN Classifier, Centroid Classifier and Naive Bayes Classifier are the three classification algorithms taken into consideration. The paper reveals the actual findings on the algorithms’ efficiencies, such as accuracy of correct classification and the speed of execution. The empirical results show that Centroid Classifier is the most accurate classifier in this case with up to 95% accuracy compared to k-NN which is 92% accurate and Naive Bayes with 91. 5%.
Introduction
The performance of classification algorithms is becoming more and more crucial nowadays. As the density of the data produced daily is getting intense, the accuracy and speed that is desired for classification can no longer be achieved by using the traditional approaches. New models and paradigms should be used, for processing the data in parallel, by using distributed environments.
In Data Mining, the classification task is also known as supervised learning because it uses a set of labeled training data to learn (generate model). Then based on the learning (model) it classifies new instances. Each classification task is made of two phases: training step and testing step. The data source is also divided into two parts: the training data set and the testing data set. One of the key things to be considered while providing the training and the test data to the classifier is the ratio in which it is given. Keeping 2/3 training data set and 1/3 test data set is preferable as that would facilitate the learning process of the classifier which in turn would provide accurate results. While the classification algorithms are applied on the training data set, its accuracy is measured when applied on the testing data set.
This paper shows the results from experiments based on two data sets. The first dataset contains in total 1191 news articles categorized in three classes: Sports, Politics and Technology. The second dataset contains in total 49034 news articles categorized in three classes: Baseball, Basketball and Football (US Sports).
k-NN Classifier
k-NN (k-nearest neighbors) algorithm belongs to the group of lazy learners. This means that k-NN does not generate a model. It uses the whole training dataset as a reference to classify new instances. The k letter in the k-NN is a parameter for the algorithm which can be k=l, 2, 3. . . . n. This value tells us the number of neighbors the algorithm will take into consideration to classify a new text document. If k=l then the class of the new test instance will be decided by the class of the most similar instance from the training dataset. For k > 1, voting method is used to classify the instance. The core part of the k-NN is to decide for a distance metric which will be used to calculate the similarity (distance) between the test data and the training data. There are proposed different distance metrics. To choose the best distance metric we have to take into consideration some factors: l) the dataset, 2) the task (classification or regression), 3) the format of data. Some well-known distance metrics used mostly are Euclidian distance, Manhattan distance, Cosine Similarity and Jaccard Coefficient. k-NN Classifier is a simple algorithm, easy to understand and very easy to implement. It works well with both numerical and nominal attributes. However, there are some drawbacks, like it does not generate a model, when dealing with big data sets it leads to space and time issue. The k-NN is considered slow classifier and it requires more experiments in order to define the best value for k parameter.
Centroid Classifier
Centroid classifier is an algorithm from the group of eager learners. It is very similar to the k-NN classifier; however the way of how both algorithms classify new instances differs a lot. Both algorithms calculate distance (cosine similarity). The k-NN classifier calculates the distance between the test instance and all training instances. On the other hand, the centroid-based classifier generates an average vector for each class and uses that average vector to calculate the similarity distance with test instances. We can clearly see the advantage of the Centroid classifier over the k-NN in time. The training step of the Centroid classifier is performed in the following way: assume we have training samples which are already labeled (classified) {(x1, y1, (x2, y2), (xn, yn} where Yi defines the class and Xi the training instances from the training dataset. For each Yi in Y we have to calculate the preclass centroids:
The classification function is very fast. Every time in order to classify an instance, we calculate the similarity between the instance and all centroids. It classifies by picking the class from the most similar centroid. Researchers consider it as one of the best text classification algorithms. Centroid classifier may lead to misclassification when applied to similar classes.
Naive Bayes
Naive Bayes is a classification technique based on the Bayes theorem. In statistics and probability theory it tries to relate the prior probability with the current probability. Equation for the Bayes theorem where: A and B are events, P(A) and P(B) are the probabilities of A and B, P(A|B) is a conditional probability, the probability of A when B, and P(B|A) is the probability of B when A.
There are two reasons why this classifier is called "naive": Bag of words assumption, and Conditional independence. The Naive Bayes algorithm treats all features (words) independently.
Naive Bayes Classifier’s learning and classification phases are both fast compared to other classification algorithm. It updates the knowledge/model as the number of training samples increase. The biggest drawback of this classifier is the conditional independence assumption, where the Naive Bayes classifier treats all the features independently. This might decrease the accuracy of the classifier.
Results and Discussion
The algorithms were tested on multiple data sets.
Accuracy
In experiment 1, the total testing data was 237 news articles which when passed through the classification algorithms output the results. The algorithms were tested on 49 politics news articles, 86 technology news articles and 102 sports news articles. We see that Centroid Classifier performance in terms of accuracy is better compared to k-NN.
Further, the experiment analyzes the accuracy of algorithms for each class independently. The results showed that Centroid is the more accurate classifier to classify technology news articles.
For the third class (sports news articles), Centroid algorithm classified correctly 101 news articles (101 out of 102). On the other hand the k-NN misclassified 4 sports news articles. Naive Bayes classifier demonstrated great outcomes in exactness. From the trials we inferred that Naive Bayes classifier can be exceptionally accurate when the quantity of training data is sufficiently high (the case of sports news articles). Though its execution time wasn’t good enough, it’s thought-about as a decent selection for classifying text documents.
Execution time
In this subsection demonstrates the execution time in seconds taken by each classification algorithms (for a single instance) using the traditional methods for data analysis. Training phase in seconds - 612 90 Testing phase (single text document) 25 1. 5 0. 42 Centroid Classifier is a fast algorithm, its training phase is approximately 90 seconds on the first data set and its testing phase is 0. 42 seconds per document. k-NN does not have a training phase as it is a lazy learner (doesn’t generate model). It showed bad (slow) results on its testing phase - took up to 25 seconds per document. The time increases linearly on the number of training documents i. e. as the no. of training set increases, the execution time increases. In real life, implementation for the k-NN Classifier we have to adapt it to process on parallel manner, because these experiments showed bad results in execution time.
Naive Bayes classifier gave undesired results in terms of execution time on the training phase. This time increases linearly on the number of training documents. If we double the training instances, the training phase of Naive Bayes will be twice as slow. On the other hand the testing phase is fast.
Findings
The two algorithms k-NN and Centroid Classifier showed that they can achieve high classification accuracy to classify text documents. When the number of documents is not very high, the algorithms can even perform very well in terms of accuracy and execution time. However, when we increase the number of news articles in the training set, then the execution time increases. As the data is increasing on Web, the traditional algorithms are not enough. Therefore, we are required to process the data by using parallel/distributed programming models.
Conclusions
This paper demonstrates the application and performance of three classification techniques on text documents (news articles). k-NN (k Nearest Neighbor), Centroid Classifier and Naive Bayes algorithms were applied on two datasets with total 1191 news articles in three categories, politics, technology and sports, and 49034 news articles in three categories, football, basketball and baseball. The results showed that the Centroid Classifier was dominant in context of accuracy whereas the k-NN Classifier turned out to be moderately accurate but a very slow classifier.