Methods Of Large Scale Text Classification In Natural Language
Abstract
Text classification is the task of classifying un-labelled natural language documents into a predefined set ofcategories. Task of classification can depend on various factorslike structure of data,size of data processed etc. Many real worldproblems however need to consider a huge amount of data tobe classified from many sources. Large scale text classificationclassifies the text into thousands of classes and in some cases,each document may belong to only a single class while inothers to more than one class. Hierarchical relations can offerextra information to a classification system which can improvescalability and accuracy. The work aims at a survey on variousmethod used for text classification in NLP which include bothMachine learning and deep learning techniques. it also describesthe evaluation measures commonly used for the classificationsystem. Keywords - Natural Language Processing, Large scaletext classification, Vector space model, Convolutional neuralnetwork, Recurrent neural networkI.
Introduction
Text classification deals with the problem of assigningdocuments to a predefined set of classes. Consider the caseof binary classification where there is just one class andeach document either belongs to it or not. Spam filtering issuch an example, where emails are classified as fraudulentor not. A classifier can be trained using positive and negativeinstances in order to perform the classification automaticallyin machine learning, but was found rarely to be 100% correcteven in the simplest case. In large Scale Text Classifica-tion,the volume of documents to be processed is also verylarge (hundreds of thousands or even millions), leading to ahigh vocabulary (unique different words in the documents,also known as types). One of the aspect of Multi label clas-sification is that the classes are connected each other. Thusthis can be a parent child relation composing a hierarchy. Aclass taxonomy offers extra information to a classificationsystem, which can be exploited either to improve scalabilityor to improve accuracy of the classification system. Also, itcan affect the evaluation of a classification system.
The unavailability of datasetsprompted the researchers to initiate a series of challengeson Large Scale Hierarchical Text Classification which wasopen all over the world. (LSHTC). The LSHTC challengesis conducted in four editions from December 2009 until2014, which attracted more than 150 teams from aroundthe world (USA, Europe and Asia). Subsequent workshopsat the conferences ECIR 2010, ECML 2011, ECML 2012and WSDM 2014 were conducted and the dicoveries of theworkshops were presented. The LSHTC initiative aimed atassessing the performance of classification systems in large-scale classification using a large number of classes. It in-cluded tracks of various scales in terms of classes, multi-taskclassification and unsupervised classification. Two corporafrom Wikipedia and from the ODP Web directory data weremainly used in the workshop which may be downloadedfrom the permanent LSHTC website. This motivated manyresearchers to do extended works in the text classificationarea.
Text Classification Process
The goal of text classification is to automatically clas-sify the text documents into one or more defined cate-gories. Classes are selected from a previously establishedtaxonomy (a hierarchy of catergories or classes). The task of representing a given document in a formwhich is suitable for data mining system is referred asdocument representation. Since data can be structured orunstructured,form of representation is very important for theclassification process i. e. in the form of instances with a fixednumber of attributes. Documents from plain text is convertedto a fixed number of attributes in a training set. This processcan be done in several ways.
Word Based Representation: The process of setting oneof the parts of speech to the given word in the document istermed as Parts Of Speech tagging. It is commonly referredto as POS tagging. Parts of speech can be nouns, verbs,adverbs, adjectives, pronouns, conjunction and their sub-categories. Parts Of Speech tagger or POS tagger tag thewords automatically. Taggers use several kinds of informa-tion for the process of tagging the words such as dictionaries,lexicons, rules, and so on. Dictionaries contain category orcategories of a particular word. That is a word may belongto more than one category. For example, run is both nounand verb. Taggers use probabilistic information to solve thisambiguity.
Graph Based Representation: Bag of-words is a typ-ical and standard way to deal with model content recordwhich is reasonable for catching word frequency. But BOWoverlooks the auxiliary and semantic data. In Graph repre-sentation,mathematical constructs are utilized to display re-lationship and basic data viably. Here, A content can suitablyrepresented as Graph in which feature term is portrayed invertex and edge connection can be the connection betweenthe feature terms. Computations identified with different ac-tivities like term weight,ranking which is useful in numerousapplications in data recovery are given by this model. Graphbased portrayal is proper method for representation of con-tent record and enhanced the aftereffect of investigation overcustomary model for various content applications. Documentis modeled as Graph where term represented by vertices andrelation between terms is represented by edges:G ={Vertex,EdgeRelation}There are generally five different types of vertices in theGraph representation: Vertex = {F,S,P,D,C},where F-Featureterm,S-Sentence,P-Paragraph,D-Document,C-Concept. EdgeRelation = {Syntax,Statistical,Semantic}Edge relations between two feature terms may be differenton the context of Graph.
- Word occurrence together in a sentence or paragraph orsection or document.
- Common words in a sentence or paragraph or sectionor document.
- Co-occurrence on the fixed window of n words.
- Semantic relation Words have similar meaning,wordsspelled same way but have different meaning,oppositewords. Term significance isn’t viably caught by the Bag-of-words approach. The Relationship between writings canbe maintained by keeping up the auxiliary representationof the data which will prompt a superior order frameworkexecution. B. Constructing Vector Space ModelVector Space Model or VSM is a representation of a set ofdocuments as vectors in a common vector space and is funda-mental to a host of IR operations ranging from scoring docson query,doc classification and document clustering. VSMis an algebraic model for representing text documents asvectors of identifiers,such as index terms. Feature subsetselection for text document classification task use an eval-uation function that is applied to a single word. Scoring ofindividual words can be performed using some of measureslike Document Frequency(DF),Term Frequency(TF) etc. Fea-ture extraction approach does not weight terms in order todiscard lower weighted like feature selection,but compactsthe vocabulary based on feature concurrencies.
TF-IDF: Term Frequency-Inverse Document Frequenyuses all tokens in dataset as vocabulary. TF is the frequencyof a token in each document. IDF is the number of documentsin which token occurs. The intuition for this measure is: Animportant word in a document will be occurring frequentlyand it should be given a high score. But if the word occur-rence is too high,it is probably not unique and thus assigneda lower score. The math formula for this measure :. tfidf(t,d,D) = tf(t,d) * tf (t,D), where t denotes the term,d de-notes each document an D denotes collection of documents.
Advantages
- Easiness in compute
- Have some basic metric to extract the most descriptiveterms in a document
- Can easily compute the similarity between 2 documentsusing it
Disadvantages
- TF-IDF is based on the bag-of-words (BoW)model. Since it uses bag of words, it does not captureposition of words in text, semantics, co-occurrences indifferent documents, etc.
- TF-IDF is only useful as a lexical level feature
- It cannot capture semantics (e. g. as compared to topicmodels, word embeddings)2) Principle Component Analysis: PCA is a classicalmultivariate data analysis tool,a very good data dimensionreduction processing technology. Suppose there are N datasamples,each sample is expressed with n observed variablesx1, x2,. . . , xn we can get a sample of data matrix. PCA usesvariance of each feature to maximize its seperability. It is anunsupervised algorithm. Steps of PCA are
- Standardize the data
- Obtain eigen vectors and eigen values from co-variancematrix or co-relation matrix.
- Sort eigen values in descending order and choose k eigenvectors that correspond to k largest eigen values where kis number of dimensions of new feature subspacek ≤ d
- Construct projection matrix W from selected k eigenvectors.
- Transform original dataset X via W to obtain a k-dimensional feature subspace Y.
Application of Text Classification Algorithm
The Data mining algorithms in Natural Language Process-ing is used to get insights from a large amount of text data. Itis a set of heuristics and calculations that creates a modelfrom data. The algorithm first analyzes the data provided,then specific types of patterns or trends are identified. Thealgorithm then uses the results of this analysis over manyiterations and the optimal parameters for creating the miningmodel was found. These parameters are then applied acrossthe entire data set to extract actionable patterns and detailedstatistics. Machine Learning (or ML) is an area of Artificial In-telligence (AI) that is a set of statistical techniques forproblem solving. In order to apply ML techniques to NLPproblems,the unstructured text is converted into a structuredformat.
- Nave Bayes classification: Naive Bayes classifier is asupervised classifier which give an approach to express pos-itive, negative and neutral sentiments in the content. NaiveBayes classifier categorize words into their respective labelsutilizing the idea of conditional probability. The advantage ofutilizing Nave Bayes on content classification is that it needslittle informational index for preparing. The raw informationfrom web experiences pre preparing, evacuation of numeric,outside words, HTML labels and uncommon images yieldingthe arrangement of words. Words with marks of positive,negative and unbiased words are labeled and is physicallyperformed by human specialists. This pre handling producesword-classification sets for preparing set. Consider a wordy from test set (unlabeled wordset) and a window of n-words (x1, x2,. . . . . . , xn) from a document. The conditionalprobability of given data point y to be in the category ofn-words from training set is given by:
- J48 algorithm used for sentiment prediction: J48 is adecision tree based classifier utilizedto produce rules for the identification of targetterms. Feature space is isolated into unique areas pursuedby the classification of test into classification marks in theprogressive mechanism. Larger training set collections arehandled with more productivity by this strategy than differentclassifiers. . In the test set inevitably, level of a node islifted up when a close element qualifies the name state ofinterior component in a similar part of the tree. Differenttwo branches of decision tree is step by step created bythe task of assignment to the word labels. J48 calculationutilizes entropy work for testing the order of terms from thetest set. The extra highlights of J48 are representing missingqualities, choice trees pruning, constant trait value ranges,inference of principles, and etc. where (Term) can be uni gram, bi gram and tri gram.
B. Deep Learning techniques for text classification
Deep learning is a technique in machine learning thatachieves great power and flexibility by learning to representthe world as nested hierarchy of concepts, with each conceptdefined in relation to simpler concepts, and more abstractrepresentations computed in terms of less abstract ones. Twoof the deep learning techniques are discussed below:
- Convolution Neural Network: CNN have been broadlyutilized in image handling which have demonstrated rela-tively exact results in it. However in NLP,where the datasources are text or sentences related to as a matrix,whenCNN handles it,each column of the lattice compares to onetoken, which is word, yet it could be a character. That is,each line is vector that speaks to a word. Commonly, thesevectors are word embeddings (low-dimensional portrayals),yet they could likewise be one-hot vectors that file the wordinto a vocabulary.
For a 10 word sentence utilizing a 100-dimensional embedding, we would have a 10100 grid as ourinput. For eg,consider a sentence classification utilizing CNNmethod portrayed in the figure 2. 3, Here three channel areaof sizes: 2, 3 and 4 are delineated, every one of whichhas 2 filters. feature maps of variable length are producedby performing channel convolution on the sentence grid. At that point 1-max pooling is performed over each guide,i. e. , the biggest number from each component delineaterecorded. Hence a univariate highlight vector is created fromeach of the six maps, and these 6 highlights are linked toshape a component vector for the penultimate layer. Thelast softmax layer at that point gets this component vectoras input and utilizes it to categorize the sentence; herebinary characterization is expected and consequently showtwo conceivable output states.
- Reccurrent Neural Network: The concept behindRNNs is to make utilization of consecutive data. In acustomary neural system we expect that all sources of input(and output) are not dependent on one another. Yet, forsome assignments that is an unfruitful thought. On the offchance that you need to anticipate the following word in asentence you better know which words preceded it. RNNs arecalled repetitive on the grounds that they play out a similarassignment for each component of a grouping, with the yieldbeing relied upon the past calculations. Another approach toconsider RNNs is that they have a ”memory” which catchesdata about what has been figured up until now. In words,RNNs can make use of data in subjectively long successions,however by and by action,they are restricted to thinking backjust a couple of steps. The uses of RNN system models are two-overlay: First, itenables us to score self-assertive sentences in view of thefact that they are so liable to happen in reality. This gives usa proportion of syntactic and semantic correctness. Secondly,a model for language enables us to create new contentThe figure underneath demonstrates a RNN being unrolled(or unfurled) into a full system. By unrolling we essentiallyimply that we work out the system for the entire succession. For instance, if the sequence we care about is a sentence of5 words, the system would be unrolled into a 5-layer neuralsystem, one layer for each word.
C. Performance measures of sentiment analysis
- Precision: Precision for a class C is the fraction oftotal number of documents that are correctly classified tothe total number of documents that classified to the class C. Precision = TPTP + FPIn which TP, FN, FP and TN refer respectively to thenumber of true positive instances, the number of falsenegative instances, the number of false positive instances andthe number of true negative instances.
- Recall: Recall is the fraction of total number of cor-rectly classified documents to the total number of documentsthat belongs to class C. Recall = TPTP + FNIn which TP, FN, FP and TN refer respectively to thenumber of true positive instances, the number of falsenegative instances, the number of false positive instances andthe number of true negative instances.
- F-measure: F-measure or F1-measure is a combinationof recall and precision which is used for performance eval-uation. F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average ofthe precision and recall. F-measure = 2*precision * recallprecision + recall
Observation
On comparing the various methods of text clas-sification,some methods works well with only smalldatasets. However most of the real time problems in NLP dealwith a large scale of data and most of the ML techniqueswere found fast on small dataset. Also considering a class tax-onomy or hierarchy is important since it offers extra informa-tion to a classification system,which can improve scalabilityand accuracy. Thus on dealing with complex problems,deeplearning was found more promising. Also while using deeplearning, Learning can be done unsupervised. Since the datato be classified can range from varied sources,representationof data,features to be selected,classifying approach(whetherML or DL)etc,which evaluation measure to be used,all willdepend mostly on the context.
Conclusions
Text Classification assigns one or more classes to a doc-ument according to their content. Classes are automaticallyselected from a previously established classes to make theprocess superfast and efficient. Deep learning is a technologythat has become an essential part of machine learning workflows. Deep learning has been used extensively in natural lan-guage processing (NLP) because it is well suited for learningthe complex underlying structure of a sentence and semanticproximity of various words. Various evaluation measures arealso decribed to check the accuracy of classification.