Sentiment analysis is the process of categorizing a text’s polarity. Text-based tweets, for example, might be classified as “positive,” “negative,” or “neutral.” A model can be trained to predict the correct sentiment given the text and the labels that go with it. Techniques for Sentiment Analysis fall into three categories: hybrid approaches, lexicon-based approaches, and machine learning approaches. Multimodal sentiment analysis, aspect-based sentiment analysis, fine-grained opinion analysis, and language-specific sentiment analysis are a few subcategories of sentiment analysis research. More recently, high-performing sentiment classifiers are trained using deep learning approaches like RoBERTa and T5, and their performance is assessed using measures like F1, recall, and precision.
Sentiment analysis was initially performed using lexicon-based methods. The two categories of techniques are corpus-based and dictionary-based. In the former type, sentiment categorization is achieved by employing a dictionary of terms, such as those available in SentiWordNet and WordNet. However, corpus-based sentiment analysis uses statistical analysis of the content of a set of articles rather than a predetermined lexicon. It does this by employing methods such as hidden Markov models (HMM), conditional random fields (CRF), and k-nearest neighbors (k-NN).
Machine-learning-based techniques proposed for sentiment analysis problems can be divided into groups:(1) traditional models and (2) deep learning models.
Conventional Models discuss to traditional machine learning methods, like support vector machines (SVM), maximum entropy classifiers, and naïve Bayes classifiers. These algorithms accept as input lexical features, verbs and adjectives, parts of speech, and sentiment lexicon-based features. The features selected will determine the accuracy of these systems.
Traditional models may not always yield the same results as Deep Learning models. Sentiment analysis can be performed using a variety of deep learning models, such as CNN, DNN, and RNN. These methods tackle classification issues at the sentence, document, or aspect levels. The section that follows will cover these deep learning techniques. The Hybrid methodologies integrate methodologies based on both machine learning and lexicon. In most of these tactics, sentiment lexicons are essential.
Dataset features are are ‘date’, ‘favorite_count’, ‘followers_count’, ‘friends_count’, ‘full_text’, ‘retweet_count’, ‘retweeted’, ‘screen_name’, ‘tweet_id’, ‘user_id.
df.info()
The function df.info() provides a concise summary of the DataFrame’s information
We used the ‘full_text’ field for further processing.
Text cleaning is a preprocessing operation designed to eliminate words or components that don’t provide enough information and could reduce sentiment analysis’s efficacy. Typically, stopwords, punctuation, and whitespace are found in text or sentence data. Sentence normalization in text cleaning requires multiple processes. During the cleaning process, the following actions were taken to guarantee uniformity across all datasets: The subsequent procedures were used to clean the dataset.
● Handling null values
When one, more, or the entire unit’s worth of information is missing, it’s known as missing data. In real-life circumstances, missing data is a major problem. NA (Not Available) values in Pandas are another name for missing data. Occasionally, a lot of datasets in DataFrame arrive with missing data, either because the data never existed or because it exists but was not collected.In Pandas missing data is represented by two value: None and NAN.A function called isna() is used to examine missing values in a Pandas DataFrame.
df.isna().sum() #Finding any null values in each column
Here the ‘retweeted’ column has 2328 null values. So we redefine that column based on ‘retweet_count’ column. If retweet_count greater than 1 retweeted will be True else False
# Treating the Nulls# When retweeted contains nulls and retweet_count is equal to 0 then its "false"
df.loc[(df['retweeted'].isnull()) & (df['retweet_count']==0), 'retweeted'] = False
# When retweeted contains nulls and retweet_count is more than or equal to 1 then its "True"
df.loc[(df['retweeted'].isnull()) & (df['retweet_count']>=1), 'retweeted'] = True
# Firstly replacing all the Retweet that are "False" but showing retweet_count more than and equal to 0 to "True"
df.loc[(df['retweeted']==False) & (df['retweet_count']>=1),'retweeted']=True
● Time stamp splitting.
When it comes to datetime in Python, timestamp is the pandas equivalent and can be used in place of datetime most of the time. This type is employed in DatetimeIndex entries and other timeseries-oriented pandas data structures. For easy processing of datetime, splitted the timestamp to date, time, year, month, day. Moreover, the original datetime column was eliminated.
#timestamp splitting
df['Dates'] = pd.to_datetime(df['date']).dt.date
df['Time'] = pd.to_datetime(df['date']).dt.time
df[['year', 'month','day']] = df['Dates'].astype(str).str.split("-", expand = True)
df.drop(['date'], axis=1, inplace=True)
● Cleansing
Removing unwanted characters like special symbols, emojis, and hashtags, which do not contribute to sentiment analysis
● Text tokenization
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens.word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.
● Normalization
Converts all uppercase characters into lowercase characters to maintain consistency.
● Handling Contraction
Involves expanding the text to their full forms to ensure consistency and accuracy during analysis
● Removing stop words and punctuation
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. Import stopwords from nltk.corpus to remove stopwords and punctuations.
● Lemmatization
Lemmatization is far more effective than stemming. In order to apply morphological analysis to words, it looks beyond word reduction and takes into account a language’s entire vocabulary. The goal is to remove only inflectional endings from words and return the lemma, or base form, of a word. Although there are nine different lemmatization techniques, wordnet uses one in this project. Wordnet is a lexical database with semantic relationships between words in over 200 languages that is available to the public. It’s among the most traditional and widely applied lemmatizer techniques. The nltk package is where you can download it.
def clean_text(text):
def remove_mentions(text):
# Regular expression pattern to match mentions
mention_pattern = r'@[\w_]+'# Remove mentions using regular expression substitution
cleaned_text = re.sub(mention_pattern, '', text)
return cleaned_text
# Remove mentions from the text
text = remove_mentions(text)
# Tokenization
tokens = word_tokenize(text)
# Lowercasing
tokens = [token.lower() for token in tokens]
# Handling contractions
contractions = {
"n't": "not",
"'s": "is",
"'re": "are",
"'ve": "have"
}
tokens = [contractions[token] if token in contractions else token for token in tokens]
# Removing stopwords and punctuation
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words and token not in string.punctuation]
# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
# Stemming
#stemmer = PorterStemmer()
#tokens = [stemmer.stem(token) for token in tokens]
return tokens
x['cleaned_full_text'] = x['full_text'].apply(clean_text)
For picturizing the most word used we can use word cloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Plot The Word Cloud
allWords=" ".join(all_words)
wordCloud = WordCloud(width =500, height=300, random_state=21, max_font_size = 119).generate(allWords)plt.imshow(wordCloud, interpolation = "bilinear")
plt.axis('off')
plt.show()
Sentiment Analysis
Sentiment analysis is the process of categorizing a text passage as neutral, negative, or positive. Sentiment analysis seeks to gain an understanding of people’s opinions in order to assist businesses in growing. It focuses on emotions (happy, sad, angry, etc.) in addition to polarity (positive, negative, and neutral). It employs a range of Natural Language Processing algorithms, including Automatic, Rule-based, and Hybrid.
Sentiment analysis refers to the contextual interpretation of words that reveals the social sentiment associated with a brand and assists businesses in assessing whether the product they are producing will be in demand or not.
For sentimental analysis for this dataset, only 2 columns named ‘Dates’ and ‘cleaned_full_text’ are taken. First find the polarity of text using TextBlob library. TextBlob module is a Python library and offers a simple API to access its methods and perform basic NLP tasks. It is built on the top of the NLTK module. Created a ‘polarity_score’ column using TextBlob and also defined sentiments for text based on values in polarity. If polarity is negative value sentiments are negative, if polarity is zero sentiments are neutral and if polarity is positive sentiments are positive.In sentiments values are redefined 0,1,2 respectively. We have plotted a pie chart for sentiments.
df_new=df.drop(['favorite_count', 'followers_count', 'friends_count','retweet_count', 'retweeted', 'screen_name', 'tweet_id','user_id','Time','year','month','day'],axis=1)
# Save the DataFrame to a CSV file
output_file2 = 'output_file.csv'
df_new.to_csv(output_file2, index=False) # Set index=False to avoid saving the DataFrame index as a separate column
from textblob import TextBlob
def polarity(text):
return TextBlob(text).sentiment.polaritydf_new['polarity_score'] = df_new['cleaned_full_text'].apply(lambda x : polarity(str(x)))
def sentiment(x):
if x<0:
return 'negative'
elif x==0:
return 'neutral'
else:
return 'positive'df_new['polarity'] = df_new['polarity_score'].map(lambda x: sentiment(x))
import plotly.graph_objects as go
fig = go.Figure(data=[go.Pie(labels=df_new['polarity'].value_counts().index.tolist(),
values=df_new['polarity'].value_counts().tolist(),
marker=dict(colors=['#006400','#8B0001','#add8e3']))])fig.update_layout(title_text='Proportion of Sentiments',title_x=0.5,
template='plotly_white')
fig.show()
Model Building
Clustering the tweets into various themes such as Positive, Negative & Neutral
def get_data(df_new,senti):
senti_df = df_new[df_new['polarity']==senti].reset_index()
return senti_df
p_corpus = get_data(df_new,'positive')
p_corpus=pd.DataFrame(p_corpus)n_corpus = get_data(df_new,'negative')
n_corpus=pd.DataFrame(n_corpus)
nt_corpus = get_data(df_new,'neutral')
nt_corpus=pd.DataFrame(nt_corpus)
#Redefining polarity to sentiments
df_new.drop('polarity_score',axis=1,inplace=True)
df_new.rename(columns={"polarity": "Sentiment"},inplace=True)
# Label Encoding "Sentiment" Column
df_new['Sentiment']=df_new['Sentiment'].replace(to_replace=['negative', 'neutral', 'positive'],value=[0,1,2])
Feature Extraction
Feature extraction is the process of converting unprocessed data into numerical features so that the information contained in the original data set can be processed. When you need to cut down on processing resources without sacrificing pertinent or crucial data, feature extraction is a helpful process. Additionally, feature extraction can lessen the quantity of redundant data used in a particular analysis. Additionally, the machine’s efforts to construct variable combinations (features) and the reduction of data speed up the machine learning process’s learning and generalization stages. Two feature reduction techniques were applied here: Word Bag & TF-IDF
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
corpus=[]
for i in df_new['cleaned_full_text']:
review = i
review=' '.join(review) # joining the words to rearrage to form the sent without stop words
corpus.append(review)
Bag of Words/Count vectorization: The bag-of-words (BOW) model is a representation that counts the number of times each word appears in arbitrary text to create fixed-length vectors. It’s common to call this process “vectorization.” An algorithm is used to convert the text into vectors with a fixed length. The number of times the word appears in a document can be counted to determine this. Word occurrences make it possible to assess the similarities between various documents and compare them for use in topic modeling, document classification, and search applications.
# Converting the Words to Vector using Bag of wordsfrom sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=2500,ngram_range=(1,3)) # top 2500 features are taken
X=cv.fit_transform(corpus).toarray()
y=df_new['Sentiment']
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=1,stratify=y)
TF-IDF : TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify words in a set of documents. We generally compute a score for each word to signify its importance in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.
# Converting the Words to Vector using TF-IDFfrom sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(ngram_range=(1,3),max_features=3000)
X=tf.fit_transform(corpus).toarray()
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=1,stratify=y)
Either we can use Bag of words or TF-IDF. Here we are moving forward with TD-IDF
Machine learning algorithms used in the models are :
- Naive Bayes
The Naïve Bayes algorithm is a supervised learning technique that solves classification problems. It is based on the Bayes theorem. Its primary application is in text classification, where a high-dimensional training dataset is used. One of the most straightforward and efficient classification algorithms, the Naïve Bayes classifier aids in the rapid development of machine learning models with rapid prediction capabilities. Being a probabilistic classifier, it makes predictions based on the likelihood that an object will occur. Among the most well-known applications of the Naïve Bayes algorithm are article classification, sentiment analysis, and spam filtering.
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()
nb.fit(x_train,y_train)
MultinomialNB()
train_pred=nb.predict(x_train)
test_pred=nb.predict(x_test)
print(classification_report(test_pred,y_test))
2. Random Forest
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier()
train_pred=rfc.predict(x_train)
test_pred=rfc.predict(x_test)
print(classification_report(test_pred,y_test))
3. LinearSVC
Multi class text classification is one of the most common applications of NLP and machine learning. There are several ways to approach this problem and multiple machine learning algorithms perform relatively good depending on the quality of data. LinearSVC is one of the algorithms which performs quite well on a range of NLP based text classification tasks. However if the requirement is to have probability distribution over all the classes then LinearSVC in scikit-learn does not provide a function like predict_proba out of the box.
Linear SVC provides a decision_function method. The decision_function predicts the confidence scores for the samples. The confidence score for a sample is the signed distance of that sample to the hyperplane.
from sklearn.svm import LinearSVC
SVCmodel = LinearSVC()
SVCmodel.fit(x_train, y_train)
LinearSVC()
train_pred=SVCmodel.predict(x_train)
test_pred = SVCmodel.predict(x_test)
print(classification_report(test_pred,y_test))
4. Logistic Regression
The supervised machine learning approach known as logistic regression is primarily employed in classification problems where the objective is to estimate the likelihood that an instance will belong to a particular class. Logistic regression is the name of the technique used for classification algorithms. Because it employs a sigmoid function to estimate the probability for the specified class and the result of the linear regression function as input, it is called regression. The distinction between logistic regression and linear regression is that the latter predicts the likelihood that an instance will belong to a specific class or not, whereas the former produces a continuous value that can be any value.
from sklearn.linear_model import LogisticRegression
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(x_train, y_train)
LogisticRegression()
train_pred=LRmodel.predict(x_train)
test_pred = LRmodel.predict(x_test)
print(classification_report(test_pred,y_test))
5. XGBoost
A distributed gradient boosting library optimized for efficiency and scalability in machine learning model training is called XGBoost. It is an ensemble learning technique that generates a stronger prediction by aggregating the predictions of several weak models. “Extreme Gradient Boosting,” or “XGBoost,” is an acronym for one of the most well-known and frequently used machine learning algorithms. It has achieved state-of-the-art performance in many machine learning tasks, including regression and classification, and it can handle large datasets.
# Xgboost Modelfrom xgboost import XGBClassifier
XGB = XGBClassifier()
XGB.fit(x_train,y_train)
MultinomialNB()
train_pred=XGB.predict(x_train)
test_pred=XGB.predict(x_test)
print(classification_report(test_pred,y_test))
6. BERT
In 2018, Google Research researchers introduced the BERT (Bidirectional Encoder Representations from Transformers) Natural Language Processing Model. When it was first proposed, it achieved cutting-edge accuracy on a variety of NLP and NLU tasks, including:
● General Language Understanding Evaluation
● Stanford Q/A dataset SQuAD v1.1 and v2.0
● Situation With Adversarial Generations
In essence, BERT is a transformer architecture encoder stack. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. The encoder stacks of BERTBASE and BERTLARGE have 12 and 24 layers, respectively. These surpass the Transformer architecture (6 encoder layers) as stated in the original paper. Larger feedforward networks are another feature of BERT architectures (LARGE and BASE) (768 and 1024).
df1['cleaned_full_text'] = [' '.join(map(str, l)) for l in df['cleaned_full_text']]
possible_label = df1.Sentiment.unique()
dict_label = {}
for index,possible_label in enumerate(possible_label):
dict_label[possible_label] = index
df1["Label"] = df1["Sentiment"].replace(dict_label)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df1.index.values,
df1.Label.values,
test_size = 0.15,
random_state=17,
stratify = df1.Label.values)
df1.loc[X_train,'data_type'] = 'train'
df1.loc[X_test,'data_type'] = 'test'
Install transformers. For model building import BertTokenizer and TensorDataset.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case = True)
#Encoding text by tokenizing using BERT Tokenizer
encoder_train = tokenizer.batch_encode_plus(df1[df1["data_type"]=='train'].cleaned_full_text.values,
add_special_tokens = True,
#return_attention_masks = True,
truncation=True,
padding='max_length',
max_length = 256,
return_tensors = 'pt',
return_overflowing_tokens=False)encoder_test = tokenizer.batch_encode_plus(df1[df1["data_type"]=='test'].cleaned_full_text.values,
add_special_tokens = True,
#return_attention_masks = True,
truncation=True,
padding='max_length',
max_length = 256,
return_tensors = 'pt',
return_overflowing_tokens=False)
input_ids_train = encoder_train['input_ids']
attention_masks_train = encoder_train["attention_mask"]
labels_train = torch.tensor(df1[df1['data_type']=='train'].Label.values)
input_ids_test = encoder_test['input_ids']
attention_masks_test = encoder_test["attention_mask"]
labels_test = torch.tensor(df1[df1['data_type']=='test'].Label.values)
data_train = TensorDataset(input_ids_train,attention_masks_train,labels_train)
data_test = TensorDataset(input_ids_test,attention_masks_test,labels_test)
We will use sequence classification model as we have to classify multi label text from the dataset.
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels = len(dict_label),
output_attentions = False,
output_hidden_states = False)
From torch we will use data loader,randomsampler to load data in an iterable format but extracting different subsamples from dataset.
from torch.utils.data import RandomSampler,SequentialSampler,DataLoaderdataloader_train = DataLoader(
data_train,
sampler= RandomSampler(data_train),
batch_size = 16
)
dataloader_test = DataLoader(
data_test,
sampler= RandomSampler(data_test),
batch_size = 32
)
from transformers import AdamW,get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(),lr = 1e-5,eps = 1e-8)
epochs = 6
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps = 0,
num_training_steps = len(dataloader_train)*epochs
)
Defining Model metrics
from sklearn.metrics import f1_score
import numpy as np
def f1_score_func(preds,labels):
preds_flat = np.argmax(preds,axis=1).flatten()
labels_flat = labels.flatten()
return f1_score(labels_flat,preds_flat,average = 'weighted')
def accuracy_per_class(preds,labels):
label_dict_reverse = {v:k for k,v in dict_label.items()}preds_flat = np.argmax(preds,axis=1).flatten()
labels_flat = labels.flatten()
for label in np.unique(labels_flat):
y_preds = preds_flat[labels_flat==label]
y_true = labels_flat[labels_flat==label]
print(f"Class:{label_dict_reverse}")
print(f"Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n")
import random
import torch
from tqdm.notebook import tqdm
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
Defining Evaluation
def evaluate(dataloader_val):
model.eval()loss_val_total = 0
predictions,true_vals = [],[]
for batch in tqdm(dataloader_val):
batch = tuple(b.to(device) for b in batch)
inputs = {'input_ids': batch[0],
'attention_mask':batch[1],
'labels': batch[2]
}
with torch.no_grad():
outputs = model(**inputs)
loss = outputs[0]
logits = outputs[1]
loss_val_total +=loss.item()
logits = logits.detach().cpu().numpy()
label_ids = inputs['labels'].cpu().numpy()
predictions.append(logits)
true_vals.append(label_ids)
loss_val_avg = loss_val_total/len(dataloader_val)
predictions = np.concatenate(predictions,axis=0)
true_vals = np.concatenate(true_vals,axis=0)
return loss_val_avg,predictions,true_vals
#Training Data
for epoch in tqdm(range(1,epochs+1)):
model.train()loss_train_total=0
progress_bar = tqdm(dataloader_train,desc = "Epoch: {:1d}".format(epoch),leave = False,disable = False)
for batch in progress_bar:
model.zero_grad()
batch = tuple(b.to(device) for b in batch)
inputs = {
"input_ids":batch[0],
"attention_mask":batch[1],
"labels":batch[2]
}
outputs = model(**inputs)
loss = outputs[0]
# logits = outputs[1]
loss_train_total +=loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm(model.parameters(),1.0)
optimizer.step()
scheduler.step()
progress_bar.set_postfix({'training_loss':'{:.3f}'.format(loss.item()/len(batch))})
# torch.save(model.state_dict(),f'/kaggle/output/BERT_ft_epoch{epoch}.model')To save the model after each epoch
tqdm.write('\nEpoch {epoch}')
loss_train_avg = loss_train_total/len(dataloader_train)
tqdm.write(f'Training Loss: {loss_train_avg}')
val_loss,predictions,true_vals = evaluate(dataloader_test)
test_score = f1_score_func(predictions,true_vals)
tqdm.write(f'Val Loss:{val_loss}\n Test Score:{test_score}')
# using the saved model
model.to(device)
#finding accuraccy
_,predictions,true_vals = evaluate(dataloader_test)
accuracy_per_class(predictions,true_vals)
from sklearn.metrics import accuracy_score
def accuracy_score_func(preds,labels):
preds_flat = np.argmax(preds,axis=1).flatten()
labels_flat = labels.flatten()
return accuracy_score(labels_flat,preds_flat)
print("Accuracy Percentage {} %:".format(100*accuracy_score_func(predictions,true_vals)))
We have performed 6 epochs with 95% of f1_score and also calculated accuracy for each class.Using BERT model we got 94% of accuracy.
For more reference refer code in detail.
In this article, we analyzed a basic understanding of how Sentimental Analysis is used to understand public emotions behind people’s tweets. As you’ve read in this article, Twitter Sentimental Analysis helps us preprocess the data (tweets) using different methods and feed it into ML models to give the best accuracy. Twitter Sentimental Analysis is used to identify as well as classify the sentiments that are expressed in the text source. Logistic Regression, SVM, and Naive Bayes are some of the ML algorithms that can be used for Twitter Sentimental Analysis. Also we performed BERT algorithm with 95% accuracy and selected as best model.
Reference: