Updated: Sep 23
In this blog I intend to describe how we can carry out Sentiment Analysis with real-world dataset using Python library namely spaCy. The library spaCy provides a concise API (Application Programming Interface) to access it's methods and properties governed by trained machine (and deep) learning models. These pipelines outputs a wide range of document properties such as – tokens, token's reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc.
We start with the pre-processing of the raw datasets i.e. yelp, imdb and amazon dataset files.
In the pre-processing we merge the three separate datasets csv files into a single csv file. We also rename the column headers like 'Message' and 'Target'. The value '0' in column 'Target' corresponds to negative and '1' corresponds to positive sentiment.
In the processing phase, we import spaCy along with other Python libraries like pandas and Scikit-learn. We can use stop_words, which is a set of default stop words for English language model in SpaCy. Next, we simply iterate through each word in the input text and if the word exists in the stop word set of the SpaCy language model, the word is removed.
Further we carry out Tokenization, which is the process of breaking a document down into words, punctuation marks, numeric digits, etc. Moreover, We can find the roots of all the words using spaCy lemmatization, which converts words in the second or third forms to their first form variants.
In the end, we used train-test-split function of Scikit-learn library to split the dataset into Training (80%) and Test (20%) datasets. Here variable is X represents column 'Message' and Y represents column 'Target'. We can fit the training dataset to the model.
As we can assess, the sentiment result for the sentence 'PARITOSH IS SMILING BROADLY' is '1' that is positive sentiment.
Contact us at email@example.com for more detailed work at Agilytics in the field of 'Natural Language Processing'.