Hotel Rating Classification with Machine Learning

Welcome to our guide on hotel rating classification using machine learning! In this blog, we’ll explore how advanced technology can help hotels better understand guest experiences and improve their services. Hotel ratings are crucial for both travelers and hotel managers, offering insights into accommodation quality and amenities. By analyzing factors like guest reviews, location, and amenities, machine learning algorithms can predict and classify hotel ratings accurately. Join us as we delve into the fascinating world of data-driven hospitality, uncovering the secrets to enhancing guest satisfaction and elevating hotel experiences for travelers worldwide. Let’s embark on this exciting journey together!

STEP 1:

Importing the important libraries and dataset

We start by importing necessary libraries like pandas for data handling and matplotlib/seaborn for visualization. Then, we handle data preprocessing tasks using NLTK, such as removing unnecessary words and punctuation and converting text to lowercase. Next, we import the hotel reviews dataset from an Excel file. After that, we build machine learning models using algorithms like Logistic Regression and Naive Bayes from the sklearn library. These models help classify hotel ratings based on guest reviews. Finally, we evaluate model performance using metrics like accuracy and confusion matrix to understand how well they predict ratings.

# import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline# data preprocessing
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from collections import Counter
from wordcloud import WordCloud
# model building
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes  import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,precision_score

This code does a few things with the data we have about hotel reviews and ratings. First, it loads the data from an Excel file into a table called “data.” Then, it shows us the first few reviews and ratings. After that, it removes a column called “@” because we don’t need it. Finally, it shows us the first few rows again to make sure the column was removed. This data has information about hotel reviews and how people rated their experiences. This step helps us clean up the data so we can use it better later on.

Step 2:

Initial Exploration

# # Descriptive Statistics
data.shape
data.info()
data.isnull().sum()
data.duplicated().sum()

This code provides descriptive statistics about our hotel reviews dataset. It tells us that there are 20,491 entries (rows) and 2 columns in the dataset. The columns are “Review” and “Rating.” The “Review” column contains text data (like comments or feedback), while the “Rating” column contains numerical data (like star ratings). The data types of the columns are specified as well. Additionally, it shows that there are no missing values in any of the columns. Lastly, it tells us that there are no duplicate entries in the dataset. These statistics help us understand the basic characteristics of our dataset before further analysis.

STEP 3-

Exploratory Data Analysis (EDA)

The data exploration of hotel ratings unveiled interesting insights. Most guests gave positive reviews, with around 75% rating their experience as 4 or 5 stars. However, some negative feedback also surfaced, highlighting areas for improvement. Parking feedback indicated overall satisfaction, with some room for enhancement. Additionally, the analysis showcased popular booking sites, perceptions of cost, and reasons for trips. These findings help understand guest preferences and guide strategies for enhancing guest satisfaction. By addressing both positive and negative aspects, hotels can improve services and maintain a positive reputation in the hospitality industry.

Let’s check these insights one by one:-

1:-Distribution of ratings

data['Rating'].value_counts()# Plotting the distribution of ratings as a pie chart
plt.figure(figsize=(8, 8))
labels = data['Rating'].value_counts().index
sizes = data['Rating'].value_counts().values
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, colors=['orange', 'pink', 'blue', 'lightskyblue', 'purple'])
plt.title('Distribution of Ratings')
plt.axis('equal') 
plt.show()

From the pie chart, we can see that most reviews give high ratings of 4 or 5 stars, making up about 75% of all feedback. This suggests a strong positive bias among guests. However, it’s important to note that some negative feedback exists too. The high number of positive ratings might mean that people tend to leave good reviews more often. It’s vital for the hotel to pay attention to both positive and negative comments to keep improving. By addressing any issues raised in the less positive reviews, they can ensure all guests have a great experience, maintaining their positive reputation in the long term.

2:- top 5 booking sites

booking_sites = data['Review'].str.extract(r'(\w+\.com)')[0].value_counts()
plt.figure(figsize=(10,5))
sns.barplot(y=booking_sites.head(5).index, x=booking_sites.head(5))
plt.title('top 5 websites for booking an hotel')
plt.xlabel(' No of customers using the website to book the hotel')
plt.ylabel('websites name')
plt.show()

From the above graph we can see that the top 5 booking sites are Hotels.com with 41 mentions, TripAdvisor.com with 33, Hotmail.com with 18, Lastminute.com with 14, and Yahoo.com with 13. This shows that Hotels.com and TripAdvisor.com are the most popular choices for booking hotels. It’s clear that people trust these sites for their travel needs. The presence of Hotmail.com and Yahoo.com suggests that email promotions play a role in booking decisions. Overall, this data highlights the importance of user reviews and easy booking processes in influencing travelers’ choices.

3:-Parking Feedback

parking_feedback = data[data['Review'].str.contains('parking', case=False)]# Count positive and negative parking feedback
positive_parking_feedback = parking_feedback[parking_feedback['Rating'] >= 3]
negative_parking_feedback = parking_feedback[parking_feedback['Rating'] < 3]
# Plot the graph
plt.figure(figsize=(8, 5))
plt.bar(['Positive', 'Negative'], [len(positive_parking_feedback), len(negative_parking_feedback)], color=['green', 'red'])
plt.xlabel('Feedback')
plt.ylabel('Number of Reviews')
plt.title('Parking Feedback')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

From the graph we made, it’s clear that lots of people shared their thoughts about the parking at the hotel. More than 900 guests said positive things, while about 218 guests weren’t happy with it. So, most guests were happy with the parking, but there were also some who didn’t like it much. This shows that the hotel should pay attention to both good and bad feedback to make sure everyone has a good experience with parking. And they can use the positive feedback to show how good their parking is to attract more guests.

4:- Expensive vs Affordable

expensive_reviews = data[data['Review'].str.contains('expensive|pricey', case=False)]
affordable_reviews = data[data['Review'].str.contains('affordable|cheap', case=False)]# Count the number of reviews mentioning each sentiment
expensive_count = expensive_reviews.shape[0]
affordable_count = affordable_reviews.shape[0]
print("Number of people saying it's expensive:", expensive_count)
print("Number of people saying it's affordable:", affordable_count)
plt.figure(figsize=(8, 5))
plt.bar(['Expensive', 'Affordable'], [expensive_count, affordable_count], color=['salmon', 'lightgreen'])
plt.title('Perceived Cost of the Hotel')
plt.ylabel('Number of Reviews')
plt.grid(axis='y')
plt.show()

From the above graph, we can see that 2302 people say that its expensive and 1686 say that its an affordable hotel

5:- Family trip vs business trip

business_trip_count = data[data['Review'].str.contains('business trip', case=False)].shape[0]
family_trip_count = data[data['Review'].str.contains('family trip', case=False)].shape[0]print("Number of people coming for a business trip:", business_trip_count)
print("Number of people coming for a family trip:", family_trip_count)
plt.figure(figsize=(8, 5))
plt.bar(['Business Trip', 'Family Trip'], [business_trip_count, family_trip_count], color=['skyblue', 'salmon'])
plt.title('Number of People by Trip Type')
plt.ylabel('Count')
plt.grid(axis='y')
plt.show()

After looking at the hotel reviews, it seems like more people, around 155, are visiting for business reasons rather than family trips, which is only 16. This tells us that most guests are here for work rather than vacationing with their families. It’s important for hotels to know this so they can offer things that make business travelers happy, like good Wi-Fi or meeting spaces. But they should also think about ways to make families feel welcome, like having fun activities for kids or family-friendly rooms. Balancing both types of guests can help hotels keep everyone happy and coming back.

6:- Plotting histograms for review length based on rating

data['Length'] = data['Review'].apply(len)
data['num_words'] = data['Review'].apply(word_tokenize).apply(len)# Create subplots for each rating category
fig, axes = plt.subplots(1, data['Rating'].nunique(), figsize=(15, 5), sharey=True)
# Iterate over each subplot and plot the histogram
for ax, (rating, sub_data) in zip(axes, data.groupby('Rating')):
ax.hist(sub_data['Length'], color='#973aa8')
ax.set_title(f'Rating {rating}')
ax.set_xlabel('Length')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
correlation = data['Rating'].corr(data['Length'])
print("Correlation between Rating and Length:", correlation)

The histogram suggests that longer reviews often come with higher ratings, indicating happier customers who share more about their experiences. However, the correlation coefficient, which measures the relationship between review length and rating, shows a weak negative correlation. This means longer reviews don’t always mean higher ratings. Other factors, like service quality and personal preferences, also influence ratings. So, while longer reviews might generally reflect positive experiences, it’s crucial for businesses to consider various factors in understanding customer feedback and improving satisfaction.

Create a sentiment target variable. If the rating is 4 or 5, the sentiment will be POSITIVE. If the rating is 1 or 2, the sentiment will be NEGATIVE. Otherwise, sentiment will be NEUTRAL.

#  Sentiment Analysis it categorizes the ratings into sentiment labels ('Positive', 'Negative', or 'Neutral')
data['Sentiment'] = data['Rating'].apply(lambda x: 'Positive' if x >= 4 else 'Negative' if x <= 2 else 'Neutral')
data['Sentiment'].value_counts()

Sentiment analysis involves categorizing ratings into three sentiment labels: Positive, Negative, or Neutral. Ratings of 4 or 5 stars are labeled as Positive, indicating satisfaction, while ratings of 1 or 2 stars are labeled as Negative, signifying dissatisfaction. Ratings between 3 stars are labeled as Neutral, implying a neutral sentiment. In the dataset, Positive sentiments dominate, with approximately 15,093 instances, followed by Negative sentiments with around 3,214 instances, and Neutral sentiments with about 2,184 instances. This analysis helps understand the overall sentiment distribution and provides insights into customer satisfaction levels in hotel reviews.

NOTES:- GUYS from above we get to know that our data is imbalance data, so we have to take care of these before model building

7:-Sentiment Analysis (Pie chart)

#  Sentiment Analysis (Pie chart)
plt.figure(figsize=(10, 5))
plt.subplot(2, 2, 4)
data['Sentiment'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['Pink', 'green', 'blue'])
plt.title('Sentiment Analysis')
plt.tight_layout()
plt.show()

Here we looked at people’s feelings about their hotel experiences. We found that most reviews, about 73.7%, were positive, showing that guests were happy. On the other hand, around 15.7% were negative, indicating some guests weren’t satisfied. About 10.7% were neutral, meaning they didn’t express strong feelings either way. Understanding these sentiments helps hotels improve and make guests happier. It’s like listening to what people say to make things better for them. So, by paying attention to these sentiments, hotels can keep their guests happy and improve their services.

STEP 4-

Text Cleaning and Preprocessing

Before building machine learning models, we preprocess the Review data to convert it into a suitable format for analysis. This involves tasks such as lowercasing, tokenization, removing special characters, stopwords, and punctuation, as well as Lemmatize to reduce words to their root forms.

# Define stop words
stop_words = set(stopwords.words("english"))# Function to remove emojis
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
# Function for data processing
def data_processing(text):
text = text.lower()  # Converting the entire text in the reviews to lowercase
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)  # Removing url's
text = re.sub(r'\S+@\S', '', text)  # Removing the emails from reviews
text = re.sub(r'\d+', '', text)  # Removing the digits from reviews
text = text.strip()  # removing extra space from the reviews
text = remove_emoji(text)  # Removing emojis
text = re.sub(r'[^\w\s]', '', text)  # remove all the punctuations from the reviews
text_tokens = word_tokenize(text)
filtered_text = [w for w in text_tokens if not w in stop_words]
return " ".join(filtered_text)
# Apply data processing function to 'Review' column
data['transform_text'] = data['Review'].apply(data_processing)
# Function for cleaning
def cleaning(text):
clean_text = text.translate(str.maketrans('', '', string.punctuation)).lower()
clean_text = [word for word in clean_text.split() if word not in stopwords.words('english')]
# Lemmatize the word
sentence = []
for word in clean_text:
lemmatizer = WordNetLemmatizer()
sentence.append(lemmatizer.lemmatize(word, 'v'))
return ' '.join(sentence)
# Apply cleaning function to 'Clean_Review' column
data['transform_text'] = data['transform_text'].apply(cleaning)

These codes help prepare text data, like hotel reviews, for sentiment analysis. The first function removes emojis, URLs, emails, digits, and extra spaces from the text while converting it to lowercase. It also tokenizes the text and removes common words like ‘and’ or ‘the’. The second function further cleans the text by removing punctuation and lemmatizing words. This ensures that the text is consistent and ready for analysis. By eliminating unnecessary information, the sentiment analysis model can focus on the important aspects of the reviews. This process helps businesses understand customer opinions better and make informed decisions to improve their services.

8:-Total Text Length Before and After Cleaning

data['T_length'] = data['transform_text'].apply(len)
original_length=data['Length'].sum()
new_length = data['T_length'].sum()print('Total text length before cleaning: {}'.format(original_length))
print('Total text length after cleaning: {}'.format(new_length))
plt.figure(figsize=(8, 6))
plt.bar(['Before Cleaning', 'After Cleaning'], [original_length, new_length], color=['skyblue', 'lightgreen'])
plt.title('Total Text Length Before and After Cleaning')
plt.ylabel('Total Length')
plt.xlabel('Cleaning Process')
plt.show()

After cleaning the text, we saw a big drop in total length. Before cleaning, it was 14,861,007, but after cleaning, it decreased to 13911757. This means we removed a lot of unnecessary stuff from the text. Cleaning likely involved getting rid of things like weird characters or words that don’t matter. By doing this, we made the text simpler and easier to understand. Having cleaner text makes it easier to study and find important information. So, by cleaning it up, we made our data more useful and ready for analysis.

9:- Most Common Words in positive words

pos_reviews=data[data.Sentiment=='Positive']
neg_reviews=data[data.Sentiment=='Negative']
from collections import Counter
count=Counter()
for text in pos_reviews['Review'].values:
for word in text.split():
count[word] +=1
count.most_common(15)# Get the top 15 positive words
top_positive_words = count.most_common(15)
# Create a DataFrame for the top positive words
pos_words = pd.DataFrame(top_positive_words, columns=['word', 'count'])
# Plot the top positive words using Plotly Express
fig = px.bar(pos_words, x='count', y='word', title='Common words in positive reviews')
fig.show()

From the above graph, we can see that In positive hotel reviews, travelers frequently mention keywords like ‘hotel’, ‘room’, ‘great’, ‘staff’, ‘good’, ‘nice’, ‘location’, and ‘breakfast’. This shows that they really like these aspects of the hotels they stay in. They appreciate things like the overall hotel experience, comfortable rooms, helpful staff, good service, pleasant atmosphere, convenient location, and tasty breakfast options. These keywords give us a clear idea of what travelers value the most when choosing a hotel. Hotels can focus on improving these areas to make guests happier and build a positive reputation among travelers.

10:- Most Common Words in negative words

from collections import Counter
count=Counter()
for text in neg_reviews['Review'].values:
for word in text.split():
count[word] +=1
count.most_common(15)# Get the top 15 Negative words
top_negative_words = count.most_common(15)
# Create a DataFrame for the top positive words
neg_words = pd.DataFrame(top_negative_words, columns=['word', 'count'])
# Plot the top positive words using Plotly Express
fig = px.bar(neg_words, x='count', y='word', orientation='h', title='Common words in Negative reviews')
fig.show()

From the above graph, we can see that The most common negative words and phrases, like “not,” “hotel,” and “room,” show areas guests are unhappy with. Words such as “did not” and “no” emphasize complaints about service and room quality. It’s clear guests often have issues with staff, service, and room conditions. This analysis helps hotels know what needs fixing, like improving room quality and service. By focusing on these areas, hotels can make guests happier and improve their reputation.

STEP 5-

“Preparing Data for Machine Learning: Label Encoding, Vectorization train test split”


# # Split the data into X and Y
X = data['transform_text']
Y = data['Sentiment']# Vectorizing the text data
vect = TfidfVectorizer()
X = vect.fit_transform(data['transform_text'])
# Splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Encoding the target variable
labelEncoder = LabelEncoder()
y_train = labelEncoder.fit_transform(y_train)
y_test = labelEncoder.transform(y_test)

So Here we prepare our data for training and testing the sentiment analysis models. First, it splits the dataset into input features (X) and the target variable (Y), which is the sentiment category. Then, it converts the text data into numerical format using TF-IDF vectorization. Next, it splits the dataset into training and testing sets to train and evaluate our models. Lastly, it encodes the sentiment labels into numeric values for the models to understand. These steps are crucial to ensure our data is properly formatted and ready for training and testing our sentiment analysis models.

STEP 6-

Handling imbalanced data

In sentiment analysis, data often exhibits class imbalance, where one sentiment category dominates the dataset while others are underrepresented. This imbalance can lead to biased models that favor the majority class, resulting in inaccurate predictions for minority classes.

# Plotting the distribution of ratings
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
data['Sentiment'].value_counts().sort_index().plot(kind='bar', color='purple')
plt.title('Distribution of Sentiment')
plt.xlabel('Rating')
plt.ylabel('Count')
np.bincount(y_train)

In this dataset, most reviews are positive (73.3%), while neutral ones are fewer (10.7%), and negative reviews make up 15.7%. To tackle this issue, we’ll use a technique called RandomOverSampler. It helps by making more copies of the less common reviews until all types are balanced. By doing this, we ensure that our model learns from all kinds of reviews equally. This balanced learning approach helps in making fair and accurate predictions, ensuring that our model doesn’t favor the majority of reviews. In simple terms, RandomOverSampler helps us train a better model that considers all kinds of reviews equally.

To address this, techniques like RandomOverSampler are employed to balance the distribution of sentiment classes by generating synthetic samples for minority classes. This ensures that each sentiment category is adequately represented in the training data, allowing models to learn from diverse examples and make more accurate predictions. In this context, the introduction of RandomOverSampler becomes crucial for mitigating the effects of class imbalance in sentiment analysis tasks.

print("Before sampling class distribution:", Counter(y_train))
ros = RandomOverSampler()
ros_X_train, ros_y_train = ros.fit_resample(x_train, y_train)
print("After sampling class distribution:", Counter(ros_y_train))

Before oversampling, there were more samples for sentiment class 2 and fewer for classes 0 and 1. After using RandomOverSampler, the number of samples became equal for all sentiment classes. This means each sentiment type now has the same amount of data for learning. Balancing the dataset in this way helps machine learning models better understand and predict all sentiment categories equally.

STEP 7-

Model Building

After getting our data ready, we went on to create and test four models: Logistic Regression, Multinomial Naive Bayes, Linear SVC, and Random Forest. These models analyze hotel reviews to predict customer sentiments accurately. Our aim is to assist hotels in understanding customer feelings better and improving their services based on feedback. By using these models, we hope to offer insights that can enhance the overall customer experience. So, our focus is on building models that can understand and interpret customer sentiments from text reviews, ultimately helping hotels provide better services and make guests happier.

Model 1- Logistic Regression

logistic_reg=LogisticRegression(random_state=0)
logistic_reg.fit(ros_X_train,ros_y_train)
logistic_reg_pred=logistic_reg.predict(x_test)
logistic_reg_acc=accuracy_score(logistic_reg_pred,y_test)
print("Test accuracy: {:.2f}%".format(logistic_reg_acc*100))
print(confusion_matrix(y_test,logistic_reg_pred))
print("\n")
print(classification_report(y_test,logistic_reg_pred))

In this code, we’re using a Logistic Regression model to predict sentiments in hotel reviews. First, we train the model using the oversampled training data (ros_X_train) and corresponding labels (ros_y_train). Then, we make predictions on the test data (x_test) to evaluate its accuracy. The test accuracy is 83.00%, which indicates how well the model performs on unseen data. The confusion matrix shows the number of correct and incorrect predictions for each sentiment class. The classification report provides precision, recall, and F1-score metrics for each sentiment class, along with overall performance metrics.

Model 2- MultinomialNB

mnb=MultinomialNB()
mnb.fit(ros_X_train,ros_y_train)
mnb_pred=mnb.predict(x_test)
mnb_acc=accuracy_score(mnb_pred,y_test)
print("Test accuracy: {:.2f}%".format(mnb_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,mnb_pred))
print("\n")
print(classification_report(y_test,mnb_pred))

In this code, we’re using a Multinomial Naive Bayes (MNB) model for sentiment analysis on hotel reviews. The test accuracy is 76.04%, indicating how well the model performs on unseen data. The confusion matrix displays the number of correct and incorrect predictions for each sentiment class. Additionally, the classification report provides precision, recall, and F1-score metrics for each sentiment class, along with overall performance metrics.

Model 3- LinearSVC

svc=LinearSVC()
svc.fit(ros_X_train,ros_y_train)
svc_pred=svc.predict(x_test)
svc_acc=accuracy_score(svc_pred,y_test)
print("Test accuracy: {:.2f}%".format(svc_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,svc_pred))
print("\n")
print(classification_report(y_test,svc_pred))

In this code, we’re using a Linear Support Vector Classifier (Linear SVC) for sentiment analysis. After training, predictions are made on the test data (x_test) to assess its accuracy, which is 83.24%. The confusion matrix illustrates the number of correct and incorrect predictions for each sentiment category. Moreover, the classification report presents precision, recall, and F1-score metrics for each sentiment class, along with overall performance metrics. This helps us understand how well the model performs for each sentiment category and overall sentiment analysis.

Model 4- RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=0)
rf.fit(ros_X_train,ros_y_train)
rf_pred=rf.predict(x_test)
rf_acc=accuracy_score(rf_pred,y_test)
print("Test accuracy: {:.2f}%".format(rf_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,rf_pred))
print("\n")
print(classification_report(y_test,rf_pred))

In this code snippet, a Random Forest Classifier is utilized for sentiment analysis. The model is constructed with 100 decision trees, employing the ‘entropy’ criterion for impurity calculation and setting the random state to ensure reproducibility. After training on the oversampled training data (ros_X_train) and corresponding labels (ros_y_train), predictions are made on the test data (x_test) to assess the model’s accuracy, yielding 81.12%. The confusion matrix illustrates the number of correct and incorrect predictions for each sentiment category. Additionally, the classification report provides precision, recall, and F1-score metrics for each sentiment class, offering insights into the model’s performance across different sentiment categories.

In this code snippet, accuracies of four models (Logistic Regression, Multinomial Naive Bayes, Linear SVC, and Random Forest) are calculated using the accuracy_score function. The accuracies are then stored in a list, and the best-performing model is determined based on the highest accuracy. A bar plot is created to visualize the accuracies of different models, with the best model annotated on the plot.

# Calculate accuracies of all models
models = ["Logistic Regression", "Multinomial Naive Bayes", "Linear SVC", "Random Forest"]
accuracies = []
for model in [logistic_reg, mnb, svc, rf]:
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)# Determine the best model based on accuracy
best_model_index = np.argmax(accuracies)
best_model_name = models[best_model_index]
# Plotting the accuracies
plt.figure(figsize=(10, 5))
plt.bar(models, accuracies, color='skyblue')
plt.title('Accuracy of Different Models')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.ylim(0, 1)
plt.grid(axis='y')
# Annotating the best model
plt.text(best_model_index, accuracies[best_model_index], f'Best Model: {best_model_name}', ha='center', va='bottom')
# Show plot
plt.tight_layout()
plt.show()

From the above graph, we can see that the Logistic Regression model achieves the highest accuracy, making it the best-performing model for sentiment analysis on the given dataset.

STEP 7-

Model Prediction

In the last part of our sentiment analysis journey, users can input their own reviews. Our trained logistic regression model then predicts if the sentiment of the review is positive, negative, or neutral. This feature lets users quickly understand the sentiment of their text. By using our model’s insights, people can make better decisions based on the sentiment labels provided. It’s a simple and effective way for anyone to get a sense of the sentiment in their writing, helping them better understand how their words might be perceived by others.

# Function to preprocess user input and make predictions using a logistic regression model
def predict_sentiment_logistic(model, vectorizer, user_input):
# Preprocess user input using the same vectorizer used during training
user_input_vectorized = vectorizer.transform([user_input])# Make prediction
prediction = model.predict(user_input_vectorized)
# Convert the predicted label back to the original sentiment
predicted_sentiment = labelEncoder.inverse_transform(prediction)
return predicted_sentiment[0]  # Return the predicted sentiment as a string
user_input = input("Enter your review: ")
predicted_sentiment_logistic = predict_sentiment_logistic(logistic_reg, vect, user_input)
print("Predicted sentiment (Logistic Regression):", predicted_sentiment_logistic)

This code helps predict the sentiment (positive, negative, or neutral) of a user’s review using a logistic regression model. It first prepares the user’s input by converting it into a format the model can understand. Then, it uses the trained model to predict the sentiment of the input. Finally, it shows the predicted sentiment to the user. So, if someone enters a review, this code can tell whether it’s positive, negative, or neutral, based on what the model has learned from past data.

Sentiment analysis plays a vital role in understanding customer feedback, especially in industries like hospitality. We’ve seen how machine learning models such as Logistic Regression, Naive Bayes, Linear SVC, and Random Forest can accurately predict sentiments from text data. By preprocessing data, training models, and assessing their performance, we’ve achieved impressive results. These models offer businesses valuable insights from customer reviews, ultimately improving service quality and satisfaction. As we continue refining and implementing these models, we’re paving the way for smarter decision-making based on customer sentiments, ensuring better experiences for all.

I would like to acknowledge that @Krishnanayakbluezone is my project mate and @Iftekarpatel Sir is my mentor. I extend my sincere gratitude to @Iftekarpatel Sir for his invaluable guidance and support in providing insights and ensuring the successful completion of the project.

DATASET AND SOURCECODE:

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Why harmonize bank statements? Explain the importance and benefits

Que sont les règles métier ? : The wizard is not complete

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Top 35+ Finance Interview Questions And Answers

Hotel Rating Classification with Machine Learning

DataRobot: A Leader in the 2024 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms

Ransomware attackers quickly weaponize PHP vulnerability with 9.8 severity rating

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

A Deep Dive into In-Context Learning | by Aris Tsakpinis | May, 2024

Can AI and Machine Learning Simulate the Human Brain

A way to let robots learn by listening will make them more useful

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Our Picks

A way to let robots learn by listening will make them more useful

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Subscribe to Updates

Hotel Rating Classification with Machine Learning

STEP 1:

Importing the important libraries and dataset

Step 2:

Initial Exploration

STEP 3-

Exploratory Data Analysis (EDA)

Create a sentiment target variable. If the rating is 4 or 5, the sentiment will be POSITIVE. If the rating is 1 or 2, the sentiment will be NEGATIVE. Otherwise, sentiment will be NEUTRAL.

STEP 4-

Text Cleaning and Preprocessing

STEP 5-

“Preparing Data for Machine Learning: Label Encoding, Vectorization train test split”

STEP 6-

Handling imbalanced data

STEP 7-

Model Building

STEP 7-

Model Prediction

Related Posts