Pso Based Weight Optimization For Similarity Measures To Address Binary Queries Using Customer Reviews In A Supervised Manner
Introduction
A novel idea of combining a question answer data set and review data set in such a way that instead of a knowledge based system which fetches answer to the question, the proposed model would find the most relevant reviews corresponding to the question. Relevance of a review depends upon its relevance score. The similarity between a question and a review will be calculated using Cosine Similarity, TF-IDF based similarity, WordNet and Word Embedding based measures. This model focuses on a PSO based weights optimization technique will be applied, where weights are the contributions given by each similarity measures in order to achieve higher accuracy in calculating relevant reviews. The model will be evaluated in terms of how best the sentiment extracted from the top reviews extracted with the reviews agrees with the answer of the question in the Q/A dataset. Particle swarm optimization (PSO) is a population based stochastic optimization technique.
In PSO, each single solution is a "particle" in the search space. All of particles have fitness values which are evaluated by the fitness function to be optimized, and have velocities which direct the flying of the particles. The particles fly through the problem space by following the current optimum particles. In this context, a particle is a weight vector which denotes the amount of participation for achieving desired goal. CHAPTER 22.
Literature Review
Text Mining with Information ExtractionThis report summarizes the experiments in career opportunities in IT domain. With these experiments it demonstrates that after applying KDD, the predictive rules that are achieved can be used to enhance the recollection of extracted information. It also explains a technique called DISCOTEX (DISCOvery from Text EXtraction) that uses both IE and KDD for text mining. First a database is constructed by using a learned information extracting algorithm to a collection of natural language papers. After this textbook data mining algorithms have been used in this extracted data, thus information is discovered that can be used for enhancing the efficiency of data extraction and many more tasks.
In this proposed method of mining text, IE has a pivot role, it preprocesses a collection of texts and passes thus extracted information to the data mining function. A knowledge bases of extraction definitions is acquired by this training on collection of text documents and thus extracted rules are then used on new text. A database is constructed from collection of text documents by utilizing IE extraction patterns and a corpus of resulting records is created. Standard KDD methods are then used to discover new relationships. Each slot-value pair in extracted database is treated like a discreet binary feature. This knowledge discovered is stored in form of prediction rules. RIPPER and APRIORI is also used to extract interesting rules. Discovered knowledge is then tested for accuracy on an independent dataset.
To test the discovered predictive rules, they are applied to predict the present information in user labeled database. The results of experiment indicate that DISCOTEX achieves comparable performance as a rule miner trained on manually created database. Ranking of Web Documents using Semantic SimilarityWeb content in general is difficult to extract information from, as it is in natural language. Many traditional search engines use lexical matching and link analysis to produce result set for user queries, these result sets are not as per user expectations as they include many unimportant pages also. Also, there are instances where content of web pages have same meaning but uses different set of words. Hence, an approach is required to explore not only keywords in the text but also the relationship between the keywords. Now, an enhanced ranking can of web pages can be done by establishing semantic similarity between the document and query given by the user and thus removing the irrelevant pages. This paper explores this new ranking model. The architecture of this model is based on – ontology processor, ranker module and document processor.
First, keywords are extracted from the document using syntactic analysis. This is done by first building a domain dictionary having related words and their synonyms. These words in dictionary are given weights based on relevance using a fuzzy set approach. Each word in document is mapped to dictionary weighted words, for each sentence. Now, each sentence will hold a relevance value and will be integrated using statistical approach. This in turn will form relevance of paragraphs and thus for the entire page. Concepts extracted from the document are compared with the user query using ontology processor. Maximum value thus obtained gives the actual relevance of the webpage with respect to user query. Similarity computation of this new approach was found to better than the traditional models. Word importance-based similarity of documents metric (WISDM)This paper presents a new technique for fast and scalable computation of similarity of scientific documents, Word importance-based similarity of documents metric.
Document distance calculation and classification of documents on basis of relevance is required because many times users have little idea of the parameters they should look for. These problems can be solved by distance metric in documents. WISDM in this context is a technique based on embeddings, it takes advantage from TF-IDF technique to enhance text similarity calculations, still improving performance significantly and reducing memory usage. The focus of paper is on unsupervised version of WISDM. WISDM has two major components – TF-TDF model and word2vec model. TF-IDF model identifies key words or key phrases by scoring them. The final result is a matrix with key-token and each row represents word2vec embedding of this key-token. Distance between two documents is computed using RV coefficient. In field of statistics, the RV coefficient is a multivariate generalization of the squared Pearson correlation coefficient. It measures the distance of two set of points that may individually be represented in a matrix. Performance of this technique is measured on a collection of test documents. The results were compared to already established methods SIF and WMD. WISDM outperforms these methods with small compromise in precision. Text Similarity
Calculations Using Text and Syntactical Structures
This paper contains work on how related texts can be treated as modified versions of each other. Modifications can be formed in form of insertion, deletion or replacement of some text with other. Units extracted with the help of POS tagging can be used in place of actual words. Dimensionality of the document is hugely reduced by converting the documents to there syntactical structures. Also, loss of information is less in comparison to methods using actual phrases. Document’s sentences are first converted in order POS tags and these are then given as input to Longest Common Subsequence algorithm which determines count and size of LCS discovered. Documents are compared and ranked according to similarity of POS tags. Documents that rank above in this comparison are then compared using actual words and phrases. Proposed method was evaluated using two set of unrelated data. Encouraging conclusions were achieved using this technique of two staged similarity determination.
Addressing Complex and Subjective Product-Related Queries with Customer Reviews
This paper targets to learn if a previously given review about a product is relevant for a current query, based on a large set of already answered questions. Each review is treated as an expert and thus votes on the response of the current question, and thus relevance function is formed based on reviews that answer the question correctly. Thus, aim is to surface the relevant answers for the question from existing reviews. Questions asked by customers can be subjective which depends on personal experience of the user. They can be linguistically complex, which means that finding answer based on word similarity cab be very difficult. Thus relevance of reviews should be found based on linguistic differences like use of synonyms. The proposed technique has two components – first, a relevance function, to compute relevance of a review in context of the question, second, a prediction function which allows reviews that are relevant to vote for the correct answer. Aim has been to be in differentiate to what makes a review relevant, and to learn this only from the data. Mixtures of experts is used to combine output of several classifiers by assigning weight to each of them. Thus, model learns both classification and relevance parameters at same time. In proposed technique called MOQA (Mixtures of Opinions for Question Answering) mixtures of experts model is adapted to come upon relevant reviews and give opinions on current question. A scoring function s(r,q) is formed to define relevance of a review r to a query q. Answering binary questions is easy as expert has to make binary decision, thus a bilinear scoring function is achieved.
For open ended questions aim is to train model such that it assigns highest score to answer that is true, Area Under the Curve (AUC) should be maximized for the ranking function. Hence during training, first the candidate answer is taken by the scoring function, instead of the question itself. Second it is made sure that score of right answer to question is higher than the non-answer. Better results were found for open-ended questions when compared to binary questions. Also, similar performance was achieved for objective and subjective questions, during the experiments. Feature selection and ensemble construction: A two-step method for aspect based sentiment analysisIn this paper cascaded framework has been presented for feature selection and classifier ensemble utilizing particle swarm optimization (PSO) for aspect based sentiment analysis. PSO is a method for calculation to optimize a problem by repeatedly trying to better the possible solutions. First it starts with random solutions to the problem and then looks for global optima in iterations. All possible answers are called as particles. Each particle can change at a rate called velocity. Each particle stores its previous best position and global best position. Each iteration constitutes of adjustment of each particle in direction of previous best position. These steps can be summarized as – evaluate, compare and imitate.
Finally, one best particle is found that meets the criteria. Two other important concepts of PSO are velocity and neighborhood, every particle has a velocity vector updated at end of each iteration, neighborhood defines how swarm particles bond in swarm and is used to update their respective velocity vector. Aspect based sentiment analysis has been performed in two steps - aspect term extraction and sentiment classification. Different features are used for aspect term extraction - Words, Local context information, Part-of-Speech (PoS) tag, Head word, Head word PoS, Chunk information, Lemma, Stop word, Word length, Prefix and suffix, Frequent aspect term, Dependency relation, WordNet, Named entity information, Character n-grams, Aspect term list, Word cluster, Semantic orientation (SO) score and Orthographic features.
User’s opinion expressed in a review are classified into the following semantic classes i. e. positive, negative, neutral and conflict. Feature selection technique has been developed using a binary version of PSO. The filtered, compact set of features perform better in comparison to the complete set of features used by the baseline model for aspect term extraction and sentiment classification. Ensemble based on PSO is constructed and put in cascade after the feature selection module. Features are used that are identified based on the properties of different classifiers and domains.
As base learning algorithms three classifiers, namely Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) are used. Experiments for aspect term extraction and sentiment analysis on two different kinds of domains show the effectiveness of the proposed approach. The proposed ensemble achieves the F-measure scores of 84. 52% and 74. 93% for aspect term extraction for restaurant and laptop domains, respectively. For sentiment classification the accuracies of 80. 07% and 75. 22% for restaurant and laptop domains, respectively are obtained.