Ads Click-Through Rate or CTR in short prediction is a crucial task in online advertising, where the goal is to predict the probability that a user will click on a particular advertisement. It plays a vital role in optimizing ad campaigns, as it helps advertisers allocate their resources effectively, target specific audience segments, and maximize the return on investment (ROI). Machine learning techniques are commonly employed in CTR prediction due to their ability to analyze vast amounts of data and identify patterns that can help predict user behavior.
Machine learning algorithms are well-suited for CTR prediction tasks because they can learn complex patterns from historical data and make predictions on new data. Here’s how machine learning plays a role in predicting ad click-through rates:
Feature Engineering
Machine learning models require relevant features to make accurate predictions. Features such as user demographics, time of day, website content, and historical click behavior are commonly used to predict ad clicks.
Model Training
Once the relevant features are extracted, machine learning models are trained using historical data. During training, the models learn the underlying patterns and relationships between the features and the target variable (click or no-click).
Prediction
After training, the model can make predictions on new data by applying the learned patterns to unseen instances. These predictions provide insights into the likelihood of users clicking on specific ads.
Performance Evaluation
The performance of the machine learning model is evaluated using various metrics such as accuracy, precision, recall, and F1-score. These metrics help assess the model’s effectiveness in predicting ad clicks and identify areas for improvement.
Predicting ad click-through rates using machine learning offers several benefits:
Improved Targeting
Machine learning models can analyze large datasets and identify patterns in user behavior, allowing advertisers to target their ads more effectively to the right audience segments.
Cost Efficiency
By predicting ad clicks more accurately, advertisers can optimize their ad campaigns and allocate their budgets more efficiently, resulting in higher ROI.
However, there are also some drawbacks to using machine learning for CTR prediction:
Data Privacy Concerns
Advertisers often rely on user data to train machine learning models, raising concerns about data privacy and security.
Model Complexity
Building accurate machine learning models for CTR prediction can be challenging due to the complex nature of user behavior and the dynamic nature of online advertising platforms.
However, we can still utilize ML algorithms to get a fair bit of an idea on how our ads will perform in the real world scenario. Let us understand in the code, how we can perform it.
Download a dataset from Kaggle
In this section, the code automates the process of downloading a dataset from Kaggle in a Google Colab environment, enabling seamless access to external data for machine learning projects and analysis. Here’s a breakdown of each step:
- Import Necessary Libraries: The code imports the required libraries for handling files, including
os
,zipfile
,ZipFile
, andfiles
from Google Colab. - Upload Kaggle API Key: The
files.upload()
function allows you to upload your Kaggle API key (kaggle.json
) using the Colab interface. This key is necessary for authenticating your access to Kaggle datasets. - Move API Key to Correct Location: After uploading the
kaggle.json
file, the code creates a.kaggle
directory in the user’s home directory (~
) using!mkdir ~/.kaggle
. It then moves the uploadedkaggle.json
file to this directory and sets its permissions to 600 to ensure it is only readable by the owner using!mv kaggle.json ~/.kaggle/
and!chmod 600 ~/.kaggle/kaggle.json
. - Authenticate Kaggle API: The Kaggle API client is instantiated (
api = KaggleApi()
) and authenticated using theapi.authenticate()
method. This step is necessary to access Kaggle datasets programmatically. - Define Download Directory: The
download_dir
variable specifies the directory path where you want to download the dataset. In this case, it’s set to/content
, which is the default directory in Google Colab. - Download Dataset: Finally, the code downloads the dataset from Kaggle using the
api.dataset_download_files()
method. It specifies the dataset name ("gauravduttakiit/clickthrough-rate-prediction"
), download directory (download_dir
), and setsunzip=True
to automatically unzip the downloaded dataset file.
import os
import zipfile
from zipfile import ZipFile
from google.colab import files# Upload your Kaggle API key (kaggle.json) using the Colab interface
files.upload()
# Move the uploaded key to the correct location
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
from kaggle.api.kaggle_api_extended import KaggleApi
# Instantiate the Kaggle API client
api = KaggleApi()
api.authenticate()
# Define the directory path where you want to download the dataset
download_dir = "/content"
# Download the dataset into the specified directory
api.dataset_download_files(dataset="gauravduttakiit/clickthrough-rate-prediction", path=download_dir, unzip=True)
Install necessary libraries
Make sure to install all the necessary libraries for the code to function. For instance: the following code installs the faker
package, which is a useful tool for generating synthetic or fake data for testing, demonstration, or other purposes in Python applications.
!pip install faker
Load, Preprocess, and Explore Data
In this section, we import necessary libraries and load the advertising dataset into a pandas DataFrame.
- pandas: Library for data manipulation and analysis.
- numpy: Library for numerical computing.
- xgboost: A machine learning algorithm used for classification tasks.
- LabelEncoder: Used to encode categorical variables.
- train_test_split: Used to split the dataset into training and testing sets.
- accuracy_score: Metric used to evaluate the performance of the model.
- joblib: Library used for saving and loading models.
- Faker: Library used to generate synthetic data.
Please be advised that Faker will be utilized to generate synthetic data, including user interactions with ads. As this data will be randomly generated and may not accurately reflect real-world behavior, we will employ a loop to iterate through 1000 seed values, aiming to identify the seed that yields the highest model accuracy. This process is solely intended to demonstrate how the model can make predictions on new campaign data. In practice, post-campaign launch, real-world data will be utilized to assess the model’s performance accurately.
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import joblib
from faker import Faker# Load the dataset
data = pd.read_csv('ad_10000records.csv')
# See the contents of the dataset
data.head()
The output shows that the DataFrame contains the following columns:
- Daily Time Spent on Site: This column contains numerical values representing the amount of time users spend on a website each day.
- Age: This column contains numerical values representing the age of the users.
- Area Income: This column contains numerical values representing the area income (average household income) where the users live.
- Daily Internet Usage: This column contains numerical values representing the amount of time users spend on the internet each day.
- Ad Topic Line: This column contains the topic of the advertisement shown to the user.
- City: This column contains the city where the user lives.
- Gender: This column contains the gender of the user.
- Country: This column contains the country where the user lives.
- Timestamp: This column contains a timestamp for when the data was collected.
- Clicked on Ad: This column contains the user interaction with the ad, where 0 indicates the user did not click on the ad and 1 indicates the user clicked on the ad.
Let us check the information of the dataframe using data.info()
The shows that there are 10000 entries (rows) of data, and the DataFrame uses a RangeIndex to label the rows from 0 to 9999. All the values are present and the different types of columns are mentioned as well.
Feature Engineering
In this section, we convert the Timestamp
column to datetime type and extract additional features such as hour
, day
, and month
from it. These features can provide valuable information about the timing of ad clicks.
# Convert 'Timestamp' column to datetime type
data['Timestamp'] = pd.to_datetime(data['Timestamp'])# Extract hour, day, and month from timestamp
data['Hour'] = data['Timestamp'].dt.hour
data['Day'] = data['Timestamp'].dt.day
data['Month'] = data['Timestamp'].dt.month
Encode Categorical Variables
In this process, we not only store unique values of the City
and Country
columns before encoding, but also encode the Gender
column using LabelEncoder.
As previously mentioned, we will generate a synthetic dataset with all the columns from our original dataset. When utilizing the Faker library, both city
and country
names will be randomly generated. However, if our trained model fails to recognize the encoded cities and countries used during training, it will result in an error. To mitigate this issue, we will capture all unique values of cities and countries before encoding them. This will enable us to later utilize them for generating random cities and countries. However, for the gender
column, saving unique values is unnecessary. This is due to the fact that, for simplicity, our data only includes Male and Female as two genders, a convention also followed by Faker.
Subsequently, we initialize label encoders for categorical variables and employ them to encode the City, Gender, and Country columns. Encoding categorical variables converts categorical data into numerical format, which is necessary for training machine learning models.
# Store unique cities and countries before encoding
unique_cities = data['City'].unique()
unique_countries = data['Country'].unique()# Initialize label encoders for categorical variables
label_encoders = {}
# Encode categorical variables
for col in ['City', 'Gender', 'Country']:
label_encoders[col] = LabelEncoder()
data[col] = label_encoders[col].fit_transform(data[col])
data.head()
After feature engineering and encoding our data will look as follows
Split Data and Model Training
Now that our preparation is done, it is time to train our model. In this section, we split the dataset into features (X
) and target variable (y
). We remove irrelevant columns like Clicked on Ad
, Ad Topic Line
, and Timestamp
from the feature set X
.
Note that we also have the option to leverage pre-trained LLMs to incorporate the Ad Topic Line
as a feature, given its potential to captivate users’ attention. However, for simplicity, we will omit this step for the time being. I mentioned this as it holds significance in predicting CTRs.
Moving on…Now, we train a XGBClassifier
model on the entire dataset. XGBClassifier is chosen as the classifier due to its ability to handle both numerical and categorical data efficiently and its capability to capture complex relationships in the data.
# Concatenate numerical and timestamp features
X = data.drop(['Clicked on Ad', 'Ad Topic Line', 'Timestamp'], axis=1)
y = data['Clicked on Ad']# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an XGBClassifier
clf_xgb = xgb.XGBClassifier(n_estimators=100, random_state=42) # Use XGBClassifier
clf_xgb.fit(X_train, y_train)
Evaluate the Model
In this section, we evaluate the performance of a machine learning classifier model on a test dataset and prints a classification report in a DataFrame format. Let’s break down each part:
y_pred = clf_xgb.predict(X_test)
predicts the labels for the test dataset (X_test
) using the trained classifier model (clf
). The predicted labels are stored in the variabley_pred
.accuracy = accuracy_score(y_test, y_pred)
calculates the accuracy of the model’s predictions by comparing the predicted labels (y_pred
) with the true labels of the test dataset (y_test
). The accuracy score is a measure of the proportion of correctly classified instances.print("Accuracy:", accuracy)
: This line prints the accuracy score, indicating how well the model performs on the test dataset.class_report = classification_report(y_test, y_pred, output_dict=True)
computes a classification report for the model’s predictions. The classification report contains various performance metrics such as precision, recall, F1-score, and support for each class. Settingoutput_dict=True
returns the report as a dictionary.class_report_df = pd.DataFrame(class_report).transpose()
converts the classification report dictionary into a DataFrame format using pandas. Each row of the DataFrame corresponds to a class, and each column corresponds to a performance metric (precision, recall, F1-score, support).print("\nClassification Report:")
prints a header indicating that the following output is the classification report.- Finally
print(class_report_df)
prints the classification report in DataFrame format, providing a structured summary of the model’s performance across different classes.
# Evaluate the model
y_pred = clf_xgb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)# Compute classification report
class_report = classification_report(y_test, y_pred, output_dict=True)
class_report_df = pd.DataFrame(class_report).transpose()
# Print classification report in DataFrame
print("\nClassification Report:")
print(class_report_df)
Our model has a decent accuracy of 88.2% and f1-score of 88%. The model is good for making predictions.
Generate Synthetic Data
As mentioned above it is time to generate a random dataset and make predictions on the dataset. Here, we generate synthetic campaign data using the Faker library. We randomly generate values for features such as Daily Time Spent on Site
, Age
, Area Income
, Gender
, and Daily Internet Usage
. For City
and Country
features, we randomly select values from the unique cities and countries present in the original dataset. Although not required, we generate Timestamp
values for the synthetic data within the current year.
fake = Faker()# Generate synthetic 'City' and 'Country' values using unique values from the original dataset
n_samples = 1000 # Number of synthetic samples
synthetic_data = pd.DataFrame({
'Daily Time Spent on Site': np.random.uniform(20, 120, n_samples),
'Age': np.random.randint(18, 65, n_samples),
'Area Income': np.random.uniform(15000, 100000, n_samples),
'Daily Internet Usage': np.random.uniform(50, 300, n_samples),
'City': [fake.random_element(unique_cities) for _ in range(n_samples)],
'Gender': [fake.random_element(['Male', 'Female']) for _ in range(n_samples)],
'Country': [fake.random_element(unique_countries) for _ in range(n_samples)],
'Timestamp': [fake.date_time_this_year() for _ in range(n_samples)]
})
Data Preprocessing and Encoding Synthetic Data
Similar to the original dataset, we convert the Timestamp
column to datetime type and extract hour, day, and month features from it for the synthetic data.
We also encode the Gender
column in the synthetic data using the same label encoder used for the original data. This ensures consistency in encoding between the original and synthetic datasets.
Just to be on the safer side, let us also filter out any synthetic values for City
and Country
that are not present in the original dataset. This ensures that the synthetic data remains consistent with the original dataset in terms of city and country values.
Now let us use only transform
method on the City
and Country
columns of the synthetic_data
DataFrame using pre-defined label encoder objects (label_encoder
). This transformation enables consistency in the categorical data vis-a-vis our trained model.
We will now remove the timestamp
column from the synthetic dataset to complete the dataset preparation for making predictions.
# Convert 'Timestamp' column to datetime type
synthetic_data['Timestamp'] = pd.to_datetime(synthetic_data['Timestamp'])# Extract hour, day, and month from timestamp for synthetic data
synthetic_data['Hour'] = synthetic_data['Timestamp'].dt.hour
synthetic_data['Day'] = synthetic_data['Timestamp'].dt.day
synthetic_data['Month'] = synthetic_data['Timestamp'].dt.month
# Encode categorical variables for synthetic data using the same label encoders
for col in ['Gender']:
synthetic_data[col] = label_encoders[col].transform(synthetic_data[col])
# Filter out any synthetic values not present in the original dataset for 'City' and 'Country'
synthetic_data = synthetic_data[synthetic_data['City'].isin(unique_cities)]
synthetic_data = synthetic_data[synthetic_data['Country'].isin(unique_countries)]
# Encode 'City' and 'Country' using label encoders
synthetic_data['City'] = label_encoders['City'].transform(synthetic_data['City'])
synthetic_data['Country'] = label_encoders['Country'].transform(synthetic_data['Country'])
# Drop Timestamp feature for synthetic data
synthetic_X = synthetic_data.drop(['Timestamp'], axis=1)
Use the Trained Model to Predict Clicks on Synthetic Data
As previously mentioned, we’ll utilize the numpy library to generate clicks on ads randomly. However, we’re taking a different approach this time. We’ll implement a loop to discover a random seed between 0 and 999, aiming for the highest model accuracy. While not recommended in real-world scenarios, this exploration allows us to delve into potential outcomes.
In practice, our model’s accuracy improves as it considers features influencing user ad clicks. Synthetic dataset randomization may overlook these nuances. While an alternative exists to emulate the model’s features in click generation, it deviates from the model’s intended purpose. Hence, we will use the loop to find the best seed for the highest model accuracy as follows. Let’s break down its functionality.
Initializationhighest_accuracy
and best_seed
variables are initialized to store the highest accuracy achieved and the corresponding seed that produced it, respectively. The loop iterates through a range of seeds, from 0 to 999, allowing for 1000 iterations. You can adjust this range based on your requirements.
Random Seed Setting
Inside the loop, np.random.seed(seed)
sets the random seed to ensure reproducibility of the synthetic data generation process for each iteration.
Synthetic Data Generation
Synthetic labels for the data (synthetic_data['Clicked on Ad']
) are generated randomly using np.random.randint(0, 2, len(synthetic_data))
. This assigns binary labels (0 or 1) to each data point in the synthetic dataset.
Model Prediction
The classifier (clf_xgb
) is then used to make predictions (synthetic_predictions
) on the synthetic features (synthetic_X
).
Accuracy Calculation
The accuracy of the classifier on the synthetic data (accuracy_synthetic
) is calculated using the accuracy_score
function, comparing the predicted labels (synthetic_predictions
) with the true labels (synthetic_data['Clicked on Ad']
).
Updating Best Seed and Accuracy
If the accuracy obtained with the current seed is higher than the highest_accuracy
seen so far, highest_accuracy
is updated to the new accuracy, and best_seed
is updated to the current seed.
Final Evaluation
After iterating through all seeds, the loop concludes, and the accuracy of the model on the synthetic data with the best seed is printed.
highest_accuracy = 0.0
best_seed = Nonefor seed in range(1000): # Adjust the range for more or fewer seeds
np.random.seed(seed) # Set the random seed
# Generate synthetic labels for each seed
synthetic_data['Clicked on Ad'] = np.random.randint(0, 2, len(synthetic_data))
# Make predictions for the current synthetic labels
synthetic_predictions = clf_xgb.predict(synthetic_X)
# Calculate accuracy for the current seed
accuracy_synthetic = accuracy_score(synthetic_data['Clicked on Ad'], synthetic_predictions)
# Update highest accuracy and best seed if necessary
if accuracy_synthetic > highest_accuracy:
highest_accuracy = accuracy_synthetic
best_seed = seed
# Evaluate the accuracy of the model on the synthetic data
print(f"Accuracy on Synthetic Data: {highest_accuracy:.4f}")
As you can see, the model’s output for random click generation falls short of expectations. Despite this, it (56%) surpasses a mere 50–50 confidence level.
Decode and Print Predictions
Let us decode the encoded labels back into their original categorical values and then print the synthetic dataset along with predictions. The code does the following.
Decode Encoded Labels
The code utilizes label_encoders
dictionary to access the label encoders used for encoding the categorical features City
, Gender
, and Country
in the synthetic dataset.
label_encoders['City'].inverse_transform(synthetic_data['City'])
decodes the encoded city labels in the City
column of the synthetic_data
DataFrame back into their original city names. Similarly,label_encoders['Gender'].inverse_transform(synthetic_data['Gender'])
and label_encoders['Country'].inverse_transform(synthetic_data['Country'])
decode the encoded gender and country labels into their original values, respectively.
Print Synthetic Dataset with Predictions
The decoded categorical features are assigned back to the corresponding columns (City
, Gender
, Country
) in the synthetic_data
DataFrame. synthetic_predictions
are added as a new column Predictions
to the synthetic_data
DataFrame. The code then prints the synthetic dataset along with the predictions by displaying the first few rows using the head()
method.
# Decode city, gender, and country names
synthetic_data['City'] = label_encoders['City'].inverse_transform(synthetic_data['City'])
synthetic_data['Gender'] = label_encoders['Gender'].inverse_transform(synthetic_data['Gender'])
synthetic_data['Country'] = label_encoders['Country'].inverse_transform(synthetic_data['Country'])# Print synthetic dataset along with predictions
synthetic_data['Predictions'] = synthetic_predictions
print("Synthetic Dataset with Predictions:")
synthetic_data.head()
Explain the predictions
Let us use the in-built feature_importances_
method of XGBClassifier
to understand the importance of each feature in model prediction. The following code is crucial for understanding the relative importance of different features in the XGBoost model and identifying the most influential features in making predictions. It provides valuable insights into the model’s decision-making process and helps in feature selection and interpretation. Let’s see how the code is working
Get Feature Importances
The first step calculates the feature importances using the feature_importances_
attribute of the trained XGBclassifier (clf_xgb
). This attribute stores the relative importance of each feature in the trained model.
Calculate Total Importance
The total importance is calculated by summing up all feature importances. This step is necessary to normalize the feature importances later.
Calculate Percentage Importances
Percentage importances are calculated by dividing each feature importance by the total importance and multiplying by 100. This step ensures that the feature importances sum up to 100%, making it easier to interpret.
Create DataFrame
A pandas DataFrame (feature_importance_df
) is created to store the feature names and their corresponding percentage importances. This DataFrame is sorted in descending order based on the feature importances.
Visualize Feature Importances
Matplotlib is used to create a horizontal bar plot to visualize the feature importances. Each bar represents the importance of a feature, and the length of the bar indicates its relative importance. The most important features are displayed at the top of the plot
Add Percentage Labels
Percentage labels are added on top of each bar to provide a numerical representation of the feature importances. These labels are positioned slightly to the right of the bars for better readability.
Customization
The plot is customized with labels for the x-axis (plt.xlabel
), y-axis (plt.ylabel
), and title (plt.title
). Additionally, the figure size is set using plt.figure(figsize=(10, 6))
to adjust the plot’s dimensions.
Display the Plot
Finally, the plot is displayed using plt.show()
.
# Get and sort feature importances
feature_importances = clf_xgb.feature_importances_
feature_names = X_train.columns
total_importance = sum(feature_importances) # Calculate total importance# Calculate percentage importances
percentage_importances = (feature_importances / total_importance) * 100
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': percentage_importances})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
# Create feature importance visualization with percentages
plt.figure(figsize=(10, 6)) # Adjust figure size as needed
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='skyblue')
plt.xlabel('Feature Importance (%)')
plt.ylabel('Feature Name')
plt.title('Feature Importance Scores (XGBoost)')
plt.gca().invert_yaxis() # Invert y-axis to display most important features on top
# Add percentage labels on top of bars
for i, v in enumerate(feature_importance_df['importance']):
plt.text(v + 0.02, i, f"{v:.2f}%", va='center') # Adjust offset for better positioning
plt.tight_layout()
plt.show()
We can observe the significance of each feature (%) in predicting the outcome, highlighting the weightage assigned by the XGBClassifier to individual features.
In conclusion, predicting ad click-through rates (CTR) plays a crucial role in digital advertising, enabling businesses to optimize their marketing strategies and maximize return on investment. Machine learning techniques have become indispensable in this domain, offering powerful tools to analyze vast amounts of data and extract valuable insights.
The presented code demonstrates a comprehensive approach to CTR prediction, starting from data preprocessing and feature engineering to model training, evaluation, and synthetic data generation. Through the use of a XGBClassifier (XGBoost model), the model learns complex relationships between various features and the likelihood of ad clicks. Furthermore, the incorporation of synthetic data allows for robust testing and validation of the model’s performance under different scenarios.
Despite the advancements in machine learning, challenges still exist, including the need for high-quality data, model interpretability, and generalization to new datasets. Additionally, ethical considerations, such as data privacy and algorithmic bias, require careful attention to ensure fair and responsible use of predictive models in advertising.
Overall, the code highlights the potential of machine learning in improving ad targeting and campaign effectiveness, ultimately driving business growth and customer engagement. By continuously refining models, leveraging new technologies, and adhering to ethical standards, advertisers can unlock the full potential of CTR prediction and deliver more personalized and relevant experiences to their audiences.