Welcome to our guide on hotel rating classification using machine learning! In this blog, we’ll explore how advanced technology can help hotels better understand guest experiences and improve their services. Hotel ratings are crucial for both travelers and hotel managers, offering insights into accommodation quality and amenities. By analyzing factors like guest reviews, location, and amenities, machine learning algorithms can predict and classify hotel ratings accurately. Join us as we delve into the fascinating world of data-driven hospitality, uncovering the secrets to enhancing guest satisfaction and elevating hotel experiences for travelers worldwide. Let’s embark on this exciting journey together!
STEP 1:
Importing the important libraries and dataset
We start by importing necessary libraries like pandas for data handling and matplotlib/seaborn for visualization. Then, we handle data preprocessing tasks using NLTK, such as removing unnecessary words and punctuation and converting text to lowercase. Next, we import the hotel reviews dataset from an Excel file. After that, we build machine learning models using algorithms like Logistic Regression and Naive Bayes from the sklearn library. These models help classify hotel ratings based on guest reviews. Finally, we evaluate model performance using metrics like accuracy and confusion matrix to understand how well they predict ratings.
# import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline# data preprocessing
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from collections import Counter
from wordcloud import WordCloud
# model building
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,precision_score
Step 2:
Initial Exploration
# # Descriptive Statistics
data.shape
data.info()
data.isnull().sum()
data.duplicated().sum()
STEP 3-
Exploratory Data Analysis (EDA)
The data exploration of hotel ratings unveiled interesting insights. Most guests gave positive reviews, with around 75% rating their experience as 4 or 5 stars. However, some negative feedback also surfaced, highlighting areas for improvement. Parking feedback indicated overall satisfaction, with some room for enhancement. Additionally, the analysis showcased popular booking sites, perceptions of cost, and reasons for trips. These findings help understand guest preferences and guide strategies for enhancing guest satisfaction. By addressing both positive and negative aspects, hotels can improve services and maintain a positive reputation in the hospitality industry.
Let’s check these insights one by one:-
1:-Distribution of ratings
data['Rating'].value_counts()# Plotting the distribution of ratings as a pie chart
plt.figure(figsize=(8, 8))
labels = data['Rating'].value_counts().index
sizes = data['Rating'].value_counts().values
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, colors=['orange', 'pink', 'blue', 'lightskyblue', 'purple'])
plt.title('Distribution of Ratings')
plt.axis('equal')
plt.show()
2:- top 5 booking sites
booking_sites = data['Review'].str.extract(r'(\w+\.com)')[0].value_counts()
plt.figure(figsize=(10,5))
sns.barplot(y=booking_sites.head(5).index, x=booking_sites.head(5))
plt.title('top 5 websites for booking an hotel')
plt.xlabel(' No of customers using the website to book the hotel')
plt.ylabel('websites name')
plt.show()
3:-Parking Feedback
parking_feedback = data[data['Review'].str.contains('parking', case=False)]# Count positive and negative parking feedback
positive_parking_feedback = parking_feedback[parking_feedback['Rating'] >= 3]
negative_parking_feedback = parking_feedback[parking_feedback['Rating'] < 3]
# Plot the graph
plt.figure(figsize=(8, 5))
plt.bar(['Positive', 'Negative'], [len(positive_parking_feedback), len(negative_parking_feedback)], color=['green', 'red'])
plt.xlabel('Feedback')
plt.ylabel('Number of Reviews')
plt.title('Parking Feedback')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
4:- Expensive vs Affordable
expensive_reviews = data[data['Review'].str.contains('expensive|pricey', case=False)]
affordable_reviews = data[data['Review'].str.contains('affordable|cheap', case=False)]# Count the number of reviews mentioning each sentiment
expensive_count = expensive_reviews.shape[0]
affordable_count = affordable_reviews.shape[0]
print("Number of people saying it's expensive:", expensive_count)
print("Number of people saying it's affordable:", affordable_count)
plt.figure(figsize=(8, 5))
plt.bar(['Expensive', 'Affordable'], [expensive_count, affordable_count], color=['salmon', 'lightgreen'])
plt.title('Perceived Cost of the Hotel')
plt.ylabel('Number of Reviews')
plt.grid(axis='y')
plt.show()
5:- Family trip vs business trip
business_trip_count = data[data['Review'].str.contains('business trip', case=False)].shape[0]
family_trip_count = data[data['Review'].str.contains('family trip', case=False)].shape[0]print("Number of people coming for a business trip:", business_trip_count)
print("Number of people coming for a family trip:", family_trip_count)
plt.figure(figsize=(8, 5))
plt.bar(['Business Trip', 'Family Trip'], [business_trip_count, family_trip_count], color=['skyblue', 'salmon'])
plt.title('Number of People by Trip Type')
plt.ylabel('Count')
plt.grid(axis='y')
plt.show()
6:- Plotting histograms for review length based on rating
data['Length'] = data['Review'].apply(len)
data['num_words'] = data['Review'].apply(word_tokenize).apply(len)# Create subplots for each rating category
fig, axes = plt.subplots(1, data['Rating'].nunique(), figsize=(15, 5), sharey=True)
# Iterate over each subplot and plot the histogram
for ax, (rating, sub_data) in zip(axes, data.groupby('Rating')):
ax.hist(sub_data['Length'], color='#973aa8')
ax.set_title(f'Rating {rating}')
ax.set_xlabel('Length')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
correlation = data['Rating'].corr(data['Length'])
print("Correlation between Rating and Length:", correlation)
Create a sentiment target variable. If the rating is 4 or 5, the sentiment will be POSITIVE. If the rating is 1 or 2, the sentiment will be NEGATIVE. Otherwise, sentiment will be NEUTRAL.
# Sentiment Analysis it categorizes the ratings into sentiment labels ('Positive', 'Negative', or 'Neutral')
data['Sentiment'] = data['Rating'].apply(lambda x: 'Positive' if x >= 4 else 'Negative' if x <= 2 else 'Neutral')
data['Sentiment'].value_counts()
NOTES:- GUYS from above we get to know that our data is imbalance data, so we have to take care of these before model building
7:-Sentiment Analysis (Pie chart)
# Sentiment Analysis (Pie chart)
plt.figure(figsize=(10, 5))
plt.subplot(2, 2, 4)
data['Sentiment'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['Pink', 'green', 'blue'])
plt.title('Sentiment Analysis')
plt.tight_layout()
plt.show()
STEP 4-
Text Cleaning and Preprocessing
Before building machine learning models, we preprocess the Review data to convert it into a suitable format for analysis. This involves tasks such as lowercasing, tokenization, removing special characters, stopwords, and punctuation, as well as Lemmatize to reduce words to their root forms.
# Define stop words
stop_words = set(stopwords.words("english"))# Function to remove emojis
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
# Function for data processing
def data_processing(text):
text = text.lower() # Converting the entire text in the reviews to lowercase
text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE) # Removing url's
text = re.sub(r'\S+@\S', '', text) # Removing the emails from reviews
text = re.sub(r'\d+', '', text) # Removing the digits from reviews
text = text.strip() # removing extra space from the reviews
text = remove_emoji(text) # Removing emojis
text = re.sub(r'[^\w\s]', '', text) # remove all the punctuations from the reviews
text_tokens = word_tokenize(text)
filtered_text = [w for w in text_tokens if not w in stop_words]
return " ".join(filtered_text)
# Apply data processing function to 'Review' column
data['transform_text'] = data['Review'].apply(data_processing)
# Function for cleaning
def cleaning(text):
clean_text = text.translate(str.maketrans('', '', string.punctuation)).lower()
clean_text = [word for word in clean_text.split() if word not in stopwords.words('english')]
# Lemmatize the word
sentence = []
for word in clean_text:
lemmatizer = WordNetLemmatizer()
sentence.append(lemmatizer.lemmatize(word, 'v'))
return ' '.join(sentence)
# Apply cleaning function to 'Clean_Review' column
data['transform_text'] = data['transform_text'].apply(cleaning)
These codes help prepare text data, like hotel reviews, for sentiment analysis. The first function removes emojis, URLs, emails, digits, and extra spaces from the text while converting it to lowercase. It also tokenizes the text and removes common words like ‘and’ or ‘the’. The second function further cleans the text by removing punctuation and lemmatizing words. This ensures that the text is consistent and ready for analysis. By eliminating unnecessary information, the sentiment analysis model can focus on the important aspects of the reviews. This process helps businesses understand customer opinions better and make informed decisions to improve their services.
8:-Total Text Length Before and After Cleaning
data['T_length'] = data['transform_text'].apply(len)
original_length=data['Length'].sum()
new_length = data['T_length'].sum()print('Total text length before cleaning: {}'.format(original_length))
print('Total text length after cleaning: {}'.format(new_length))
plt.figure(figsize=(8, 6))
plt.bar(['Before Cleaning', 'After Cleaning'], [original_length, new_length], color=['skyblue', 'lightgreen'])
plt.title('Total Text Length Before and After Cleaning')
plt.ylabel('Total Length')
plt.xlabel('Cleaning Process')
plt.show()
9:- Most Common Words in positive words
pos_reviews=data[data.Sentiment=='Positive']
neg_reviews=data[data.Sentiment=='Negative']
from collections import Counter
count=Counter()
for text in pos_reviews['Review'].values:
for word in text.split():
count[word] +=1
count.most_common(15)# Get the top 15 positive words
top_positive_words = count.most_common(15)
# Create a DataFrame for the top positive words
pos_words = pd.DataFrame(top_positive_words, columns=['word', 'count'])
# Plot the top positive words using Plotly Express
fig = px.bar(pos_words, x='count', y='word', title='Common words in positive reviews')
fig.show()
10:- Most Common Words in negative words
from collections import Counter
count=Counter()
for text in neg_reviews['Review'].values:
for word in text.split():
count[word] +=1
count.most_common(15)# Get the top 15 Negative words
top_negative_words = count.most_common(15)
# Create a DataFrame for the top positive words
neg_words = pd.DataFrame(top_negative_words, columns=['word', 'count'])
# Plot the top positive words using Plotly Express
fig = px.bar(neg_words, x='count', y='word', orientation='h', title='Common words in Negative reviews')
fig.show()
STEP 5-
“Preparing Data for Machine Learning: Label Encoding, Vectorization train test split”
# # Split the data into X and Y
X = data['transform_text']
Y = data['Sentiment']# Vectorizing the text data
vect = TfidfVectorizer()
X = vect.fit_transform(data['transform_text'])
# Splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Encoding the target variable
labelEncoder = LabelEncoder()
y_train = labelEncoder.fit_transform(y_train)
y_test = labelEncoder.transform(y_test)
So Here we prepare our data for training and testing the sentiment analysis models. First, it splits the dataset into input features (X) and the target variable (Y), which is the sentiment category. Then, it converts the text data into numerical format using TF-IDF vectorization. Next, it splits the dataset into training and testing sets to train and evaluate our models. Lastly, it encodes the sentiment labels into numeric values for the models to understand. These steps are crucial to ensure our data is properly formatted and ready for training and testing our sentiment analysis models.
STEP 6-
Handling imbalanced data
In sentiment analysis, data often exhibits class imbalance, where one sentiment category dominates the dataset while others are underrepresented. This imbalance can lead to biased models that favor the majority class, resulting in inaccurate predictions for minority classes.
# Plotting the distribution of ratings
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
data['Sentiment'].value_counts().sort_index().plot(kind='bar', color='purple')
plt.title('Distribution of Sentiment')
plt.xlabel('Rating')
plt.ylabel('Count')
np.bincount(y_train)
To address this, techniques like RandomOverSampler are employed to balance the distribution of sentiment classes by generating synthetic samples for minority classes. This ensures that each sentiment category is adequately represented in the training data, allowing models to learn from diverse examples and make more accurate predictions. In this context, the introduction of RandomOverSampler becomes crucial for mitigating the effects of class imbalance in sentiment analysis tasks.
print("Before sampling class distribution:", Counter(y_train))
ros = RandomOverSampler()
ros_X_train, ros_y_train = ros.fit_resample(x_train, y_train)
print("After sampling class distribution:", Counter(ros_y_train))
STEP 7-
Model Building
After getting our data ready, we went on to create and test four models: Logistic Regression, Multinomial Naive Bayes, Linear SVC, and Random Forest. These models analyze hotel reviews to predict customer sentiments accurately. Our aim is to assist hotels in understanding customer feelings better and improving their services based on feedback. By using these models, we hope to offer insights that can enhance the overall customer experience. So, our focus is on building models that can understand and interpret customer sentiments from text reviews, ultimately helping hotels provide better services and make guests happier.
Model 1- Logistic Regression
logistic_reg=LogisticRegression(random_state=0)
logistic_reg.fit(ros_X_train,ros_y_train)
logistic_reg_pred=logistic_reg.predict(x_test)
logistic_reg_acc=accuracy_score(logistic_reg_pred,y_test)
print("Test accuracy: {:.2f}%".format(logistic_reg_acc*100))
print(confusion_matrix(y_test,logistic_reg_pred))
print("\n")
print(classification_report(y_test,logistic_reg_pred))
Model 2- MultinomialNB
mnb=MultinomialNB()
mnb.fit(ros_X_train,ros_y_train)
mnb_pred=mnb.predict(x_test)
mnb_acc=accuracy_score(mnb_pred,y_test)
print("Test accuracy: {:.2f}%".format(mnb_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,mnb_pred))
print("\n")
print(classification_report(y_test,mnb_pred))
Model 3- LinearSVC
svc=LinearSVC()
svc.fit(ros_X_train,ros_y_train)
svc_pred=svc.predict(x_test)
svc_acc=accuracy_score(svc_pred,y_test)
print("Test accuracy: {:.2f}%".format(svc_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,svc_pred))
print("\n")
print(classification_report(y_test,svc_pred))
Model 4- RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,criterion='entropy',random_state=0)
rf.fit(ros_X_train,ros_y_train)
rf_pred=rf.predict(x_test)
rf_acc=accuracy_score(rf_pred,y_test)
print("Test accuracy: {:.2f}%".format(rf_acc*100))
from sklearn.metrics import classification_report
print(confusion_matrix(y_test,rf_pred))
print("\n")
print(classification_report(y_test,rf_pred))
In this code snippet, accuracies of four models (Logistic Regression, Multinomial Naive Bayes, Linear SVC, and Random Forest) are calculated using the accuracy_score
function. The accuracies are then stored in a list, and the best-performing model is determined based on the highest accuracy. A bar plot is created to visualize the accuracies of different models, with the best model annotated on the plot.
# Calculate accuracies of all models
models = ["Logistic Regression", "Multinomial Naive Bayes", "Linear SVC", "Random Forest"]
accuracies = []
for model in [logistic_reg, mnb, svc, rf]:
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)# Determine the best model based on accuracy
best_model_index = np.argmax(accuracies)
best_model_name = models[best_model_index]
# Plotting the accuracies
plt.figure(figsize=(10, 5))
plt.bar(models, accuracies, color='skyblue')
plt.title('Accuracy of Different Models')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.ylim(0, 1)
plt.grid(axis='y')
# Annotating the best model
plt.text(best_model_index, accuracies[best_model_index], f'Best Model: {best_model_name}', ha='center', va='bottom')
# Show plot
plt.tight_layout()
plt.show()
STEP 7-
Model Prediction
In the last part of our sentiment analysis journey, users can input their own reviews. Our trained logistic regression model then predicts if the sentiment of the review is positive, negative, or neutral. This feature lets users quickly understand the sentiment of their text. By using our model’s insights, people can make better decisions based on the sentiment labels provided. It’s a simple and effective way for anyone to get a sense of the sentiment in their writing, helping them better understand how their words might be perceived by others.
# Function to preprocess user input and make predictions using a logistic regression model
def predict_sentiment_logistic(model, vectorizer, user_input):
# Preprocess user input using the same vectorizer used during training
user_input_vectorized = vectorizer.transform([user_input])# Make prediction
prediction = model.predict(user_input_vectorized)
# Convert the predicted label back to the original sentiment
predicted_sentiment = labelEncoder.inverse_transform(prediction)
return predicted_sentiment[0] # Return the predicted sentiment as a string
user_input = input("Enter your review: ")
predicted_sentiment_logistic = predict_sentiment_logistic(logistic_reg, vect, user_input)
print("Predicted sentiment (Logistic Regression):", predicted_sentiment_logistic)
This code helps predict the sentiment (positive, negative, or neutral) of a user’s review using a logistic regression model. It first prepares the user’s input by converting it into a format the model can understand. Then, it uses the trained model to predict the sentiment of the input. Finally, it shows the predicted sentiment to the user. So, if someone enters a review, this code can tell whether it’s positive, negative, or neutral, based on what the model has learned from past data.
Sentiment analysis plays a vital role in understanding customer feedback, especially in industries like hospitality. We’ve seen how machine learning models such as Logistic Regression, Naive Bayes, Linear SVC, and Random Forest can accurately predict sentiments from text data. By preprocessing data, training models, and assessing their performance, we’ve achieved impressive results. These models offer businesses valuable insights from customer reviews, ultimately improving service quality and satisfaction. As we continue refining and implementing these models, we’re paving the way for smarter decision-making based on customer sentiments, ensuring better experiences for all.
I would like to acknowledge that @Krishnanayakbluezone is my project mate and @Iftekarpatel Sir is my mentor. I extend my sincere gratitude to @Iftekarpatel Sir for his invaluable guidance and support in providing insights and ensuring the successful completion of the project.
DATASET AND SOURCECODE: