Analysis Of Sentiments Using Facial Features Hand Gestures Audio And Textual Clues
Abstract:
An essential component of the information period is now ready to discover the sentiment of other people. For past years, before decision making people will ask their friends or relatives about opinions. In order to know the feelings, views and feedback of the general people about a product or service most of the organization conducts opinion polls and surveys. For the past few years web documents that describes individual opinions and views are receiving greater attention. Reviews, comments, recommendations, feedbacks and ratings are usually created by people. Nowadays the market is more interested in mining opinions from video data rather than text data. The proposal is a novel methodology for multimodal sentiment analysis, which consists in harvesting sentiments from Web videos by demonstrating a model that uses audio, visual and textual modalities as sources of information. We are using multilevel fusion, feature extraction and classification.
Introduction
To date, most of the works in sentiment analysis have been carried out on natural language processing on text reviews for the past decade. Available dataset and resources for sentiment analysis are restricted to text-based sentiment analysis only. With the advent of social media, people are now extensively using the social media platform to express their opinions. People are increasingly making use of videos (e. g. , YouTube, Vimeo, VideoLectures), images (e. g. , Instagram, Face-book) and audios (e. g. , podcasts) to air their opinions on social media platforms. Thus, there is a growing demand to mine opinions and identify sentiments from the diverse modalities. The multimodal sentimental analysis available so far relay on emotions expressed in the video, facial features and text. The special people who cannot speak make use of hand signs to convey their views which goes unnoticed by the existing system. Hence there is a need for a generic system which should be capable of analyzing the sentimentsusing the available multimodal data. Here we have used the artificial neural networks for training the dataset and classification.
ANNs are considered nonlinear statistical data modelling tools where the complex relationships between inputs and outputs are modelled or patterns are found. An ANN has several advantages but one of the most recognized of these is the fact that it can actually learn from observing data sets. In this way, ANN is used as a random function approximation tool. These types of tools help estimate the most cost-effective and ideal methods for arriving at solutions while defining computing functions or distributions. ANN takes data samples rather than entire data sets to arrive at solutions, which saves both time and money. ANNs are considered fairly simple mathematical models to enhance existing data analysis technologies.
Relatedwork
The videos are obtained from the youtube dataset for the experiment which are not more than 3 minutes in duration. The videos are divided into frames. Each frame is pre-processed and then analyzed. The first 30 seconds are skipped which contains titles and other introductory phrases. The sentiment is predicted for each frame and then combined. Video features: Analysis of facial expressions: Humans are known to express emotions in a number of ways, including, to a large extent, through the face. Facial expressions play a significant role in the identification of emotions in a multimodal stream. A facial expression analyzer automatically identifies emotional clues associated with facial expressions, and classifies facial expressions in order to define sentiment categories and to discriminate between them. We use positive, negative and neutral as sentiment classes in the classification problem. In the annotations provided with the YouTube dataset, each video was segmented into some parts and each of the sub segments was of few seconds duration. Every segment was annotated as either 1, 0, or _ 1 denoting positive, neutral or negative sentiment.
Using a MATLAB code, we converted all videos in the dataset to image frames. Subsequently, we extracted facial features from each image frame. To extract facial characteristic points (FCPs) from the images, we used the facial recognition software Luxand FSDK. From each image, we extracted 66 FCPs; The FCPs were used to construct facial features, which are defined as distances between FCPs; GAVAM was also used to extract facial expression features from the face. In our experiment we used the features extracted by FSDK along with the features extracted using GAVAM.
If a segment of a video has n number of images, then we extracted features from each image and take average of those feature values in order to compute the final facial expression feature vector for a segment.
Analysis of hand signs: The hand signs is one of the effective tools which is used for conveying the sentiments of the people. The use of sign language is not only limited to individuals with impaired hearing or speech to communicate with each other or non-sign-language speakers and it is often considered as a prominent medium of communication. Instead of acoustically conveyed sound patterns, sign language uses manual communication to convey meaning. It combines hand gestures, facial expressions along with movements of other body parts such as eyes, legs, etc.
Recent studies on speech based emotion analysis have focused on identifying several acoustic features such as fundamental frequency (pitch), intensity of utterance [19], bandwidth, and duration. The speaker-dependent approach gives much better results than the speaker-independent approach. Where about 98% accuracy was achieved by using the Gaussian mixture model(GMM) as a classifier, with prosodic, voice quality as well as Mel frequency cepstral coefficients(MFCC)employed as speech features. However, the speaker-dependent approach is not feasible in many applications that deal with a very large number of possible users(speakers).
Linguistic features: Analysis of text
Affective content recognition in text is a rapidly developing area of natural language processing, which has gathered the attention of both research communities and industries in recent years. Sentiment analysis tools have numerous applications. For example, it helps companies to comprehend customer sentiments about products and, political parties to understand what voters feel about party's actions and proposals. Significant studies have been done to identify positive, negative, and neutral sentiment associated with words, multi-words, phrases, sentences, and documents. The task of automatically identifying fine grained emotions, such as anger, joy, surprise, fear, disgust, and sadness, explicitly or implicitly expressed in a text has been addressed by several researchers. So far, approaches to text-based emotion and sentiment detection rely mainly on rule-based techniques, bag of words modelling using a large sentiment or emotion lexicon, or statistical approaches that assume the availability of a large data set annotated with polarity or emotion labels.
Experiment
The experiment involves extraction of the individual features from the visual data. The image processing are applied to recognize the facial features and the the hand signs. The results are fused to form a feature vector. Simultaneously, the audio signal with the. wav extension is analysed and using the speech to text algorithm it is converted to text. The text present in the caption of the video is also considered. The text is then classified using (SVM) Support vector machine. Then the result of the visual and audio data are fused to form a final feature vector. The system is trained using the neural network and the dataset is built. Then the neural network is used to classify the sentiment according to the previously trained dataset present in the system.
Proposed System
The proposed system is capable of analyzing the sentiments of the speaker using the multimodal data available. The sentiments and opinions of special people is also predicted by the system. Assign random weights to all the linkages to start the algorithm and using the inputs and the (input->hidden node) linkages, find the activation rate of hidden nodes using the activation rate of hidden nodes and linkages to output, find the activation rate of output nodes. The system find the error rate at the output node and recalibrate all the linkages between the hidden nodes and the output nodes using the weights and the error found at the output node, cascade down the error to hidden nodes. Then the system recalibrate the weights between the hidden node and the input nodes and repeat the process till the convergence criterion is met using the final linkage weights score the activation rate of the output nodes.
Conclusion and Future Scope
In the experiment, the features are extracted from the multimodal data using various techniques to get the actual opinion expressed in the video. The result of the analysis is obtained in a form that is convenient to the user. The sentimental analysis of the video prevents the observer from getting confused and will help them infer the actual information. The future works can be done to improve the system such that it can learn on its own by using unsupervised classifiers. Applying Sentiment analysis to mine the huge amount of data has become an important research problem. It is found that sentiment classification is domain dependent. Different types of classification algorithms should be combined in order to overcome their individual drawbacks and benefit from each other’s merits, and enhance the sentiment classification performance. Sentiment analysis can be developed for new applications. The techniques and algorithms used for sentiment analysis have made good progress, but a lot of challenges in this field remain unsolved. More future research can be done for solving these challenges.