In today’s interconnected world, email has become one of the primary modes of communication. However, with its widespread use comes the inevitable challenge of spam emails cluttering our inboxes. Spam emails have been a nuisance for email users for decades. They clutter inboxes, waste time, and sometimes even pose security risks. To combat this issue, spam detection models play a crucial role in filtering out unwanted emails, ensuring that users receive only relevant and legitimate messages in their inbox.
Efficient Resource Management
By filtering out spam emails, these models help users save time and streamline their email management process.
Enhanced Security
Spam emails often contain malicious links or attachments. Detection models help protect users from potential cybersecurity threats.
Improved User Experience
With fewer unwanted emails to sift through, users can focus on important communications, leading to a better overall experience.
Business Relevance
For businesses, spam filters are essential for maintaining professional communication channels and protecting sensitive information.
In today’s blog, we will create a spam detection model using pre-trained LLM. Let’s start sieving the spam from non-spam emails
Import Libraries to Download Data from Kaggle
import os
import zipfile
from zipfile import ZipFile
from google.colab import files
Upload Kaggle API Key
We need to ensure that our kaggle key is not open to the other users and is secure. Therefore, we need to do the following
files.upload()
It prompts the user to upload files to the Colab environment. In this context, it’s used to upload thekaggle.json
file, which contains your Kaggle API credentials. Download your kaggle credentials injson
format from kaggle.com!mkdir ~/.kaggle
It creates a directory named.kaggle
in the user’s home directory (~
), where the Kaggle API expects to find thekaggle.json
file.!mv kaggle.json ~/.kaggle/
It moves the uploadedkaggle.json
file to the.kaggle
directory. This is important because the Kaggle API looks for thekaggle.json
file in this directory to authenticate the user.!chmod 600 ~/.kaggle/kaggle.json
It sets the permissions of thekaggle.json
file to read and write for the owner only. This ensures that only the owner (you) can access the API credentials stored in thekaggle.json
file, which is important for security reasons.
In summary, this code sequence is essential for setting up Kaggle API authentication in Google Colab, allowing you to seamlessly access Kaggle datasets and competitions directly from your notebook. It ensures that your API credentials are stored securely and accessible to the Kaggle API.
files.upload()
!mkdir ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
Download the Data
The provided code snippet demonstrates how to programmatically download datasets from Kaggle using the Kaggle API within a Python environment. Let’s break down the significance of each part of the code:
from kaggle.api.kaggle_api_extended import KaggleAPI
This line imports theKaggleApi
class from thekaggle_api_extended
module. This class provides methods for interacting with the Kaggle API, such as downloading datasets, competitions, and submitting entries.api = KaggleApi()
It instantiates an object of theKaggleApi
class, which allows you to interact with the Kaggle API using methods provided by this class.api.authenticate()
It authenticates the Kaggle API client using the credentials stored in thekaggle.json
file located in the~/.kaggle
directory. Authentication is necessary to access Kaggle datasets and other resources via the API.download_dir
It specifies the directory path where you want to download the dataset. In this case, I have set itto"/content"
, which is the default directory in Google Colab notebooks.api.dataset_download_files
The above method is used to useKaggleApi
object to download the specified dataset. The method takes several parameters, including the dataset name (dataset
), the path where you want to save the dataset (path
), and whether to unzip the downloaded files (unzip
). By settingunzip=True
, the downloaded files will be automatically extracted after download.
from kaggle.api.kaggle_api_extended import KaggleApiapi = KaggleApi()
api.authenticate()
download_dir = "/content"
api.dataset_download_files(dataset="purusinghvi/email-spam-classification-dataset", path=download_dir, unzip=True)
Load and Read the Data
Our dataset contains two columns:
- label
- ‘1’ indicates that the email is classified as spam.
- ‘0’ denotes that the email is legitimate (not spam).
2. text
- This column contains the actual content of the email messages.
dataset_path = '/content/combined_data.csv'
df = pd.read_csv(dataset_path)
df.head(10)
Install Library to Speed Up Our Training Process in GPU
The following command ensures that you have the latest version of the accelerate
library installed in your environment. This is important for compatibility with other libraries and frameworks, as well as for accessing the latest features and improvements introduced in newer versions of the library.
The accelerate
library is a tool for optimizing the performance of PyTorch and TensorFlow applications on GPUs. It provides utilities and tools for improving the efficiency of training and inference processes, particularly on NVIDIA GPUs.
!pip install accelerate -U
Import Necessary Libraries for Our Model
Here, we import libraries required for data preprocessing, model training, and evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, TrainingArguments, Trainer, EarlyStoppingCallback
from torch.utils.data import Dataset, DataLoader
import torch
Specify the Device
As our model will consume a lot of RAM, it is better to train it on a GPU. If your machine has a compatible GPU and you have installed the necessary CUDA toolkit and drivers, PyTorch can leverage the GPU for accelerated computation. This is especially beneficial for deep learning tasks, as GPUs are highly parallelizable and can perform matrix operations much faster than CPUs. Using a GPU can significantly reduce training times for deep learning models.
Not all machines have GPUs, and even among those that do, users may not always want to use them (e.g., if the GPU is being used for other tasks). By checking torch.cuda.is_available()
, we ensure that our code is compatible with a wide range of environments. If a GPU is available, our code will utilize it; otherwise, it will fall back to using the CPU.
By setting device
, we specify which device PyTorch should use for tensor computations. This allows us to write device-agnostic code that automatically adapts to the available hardware. Throughout our code, we can then send tensors and models to device
using to(device)
, ensuring that computations are performed on the appropriate device.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Preprocess the Data
We preprocess the text data by converting it to lowercase to ensure consistency during tokenization and model training.
df['text'] = df['text'].apply(lambda x: x.lower())
Prepare the Dataset and Load the Model
We split the dataset into training and testing sets, load a pre-trained DistilBERT model, and move it to the specified device.
train_test_split(df['text'], df['label'], test_size=0.2)
It splits the dataset into training and testing sets, which is a fundamental step in supervised machine learning. By splitting the dataset, you create separate subsets for training the model (train_texts and train_labels) and evaluating its performance (test_texts and test_labels). This ensures that the model’s performance is assessed on unseen data, which helps prevent overfitting and provides a more accurate evaluation of the model’s generalization capabilities.tokenizer = DistilBertTokenizer.from_pretrained(model_name)
It initializes a tokenizer specific to the DistilBERT model. Tokenization is the process of converting raw text data into numerical input that the model can understand. The tokenizer splits the text into tokens (individual words or subwords) and converts them into numerical representations (indices) that the model can process.model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
Here, the DistilBERT model is loaded from a pre-trained checkpoint.num_labels=2
specifies that the model will be used for binary classification tasks, which is common in spam detection, where there are two classes (e.g. spam/not spam).model.to(device)
Finally, it moves the model to the specified device (CPU or GPU), ensuring that computations are performed on the selected hardware. In this case,device
is chosen based on the availability of a GPU (if available) to accelerate training and inference as mentioned above. Utilizing GPU acceleration can significantly speed up the computations involved in training and inference, especially for large models like DistilBERT.
train_texts, test_texts, train_labels, test_labels = train_test_split(df['text'], df['label'], test_size=0.2)model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)
Tokenize the Training and Testing Datasets
We tokenize the text data using the DistilBERT tokenizer, ensuring proper formatting for model input.
- Tokenization: The
tokenizer
object is responsible for tokenizing the input text into tokens that the model understands. Tokenization involves splitting the text into individual words or subwords, and converting each token into its corresponding token ID. This process is essential because it allows the model to process textual data, which is represented as numerical input. - Padding: In NLP tasks, inputs often have varying lengths. However, neural networks require fixed-size inputs. Padding is the process of adding special tokens (usually zeros) to the shorter sequences so that all sequences have the same length. This ensures that the input data can be efficiently batched together during training. Padding is crucial for maintaining consistency in the input shape and enabling efficient processing by the model.
- Truncation: Some sequences may exceed the maximum length supported by the model. Truncation involves removing tokens from the end of such sequences to ensure that they fit within the model’s maximum input length. Truncation is necessary to prevent memory errors and ensure that all inputs are compatible with the model’s input size constraints.
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True)
Create a Dataset Class
We define a custom Dataset class to prepare the data for model training. The code defines a custom PyTorch dataset class named EmailDataset
. This class is designed to encapsulate the data required for training or evaluating a model that performs our task related to spam classification.
It also encapsulates both the input encodings and labels, provides methods for efficient data retrieval, and enables the determination of the dataset size. This abstraction helps streamline the training and evaluation process, making it easier to work with email data within the PyTorch ecosystem.
Also by creating a class, we can instantiate multiple instances of this class to represent different datasets, promoting modularity and making the code easier to maintain and extend.
class EmailDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labelsdef __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
Convert Data into Torch Dataset
We will use our class created above to convert our tokenized data into Torch Dataset objects for efficient handling during training.
train_dataset = EmailDataset(train_encodings, train_labels.tolist())
test_dataset = EmailDataset(test_encodings, test_labels.tolist())
Define the Training Arguments
The following code defines the training arguments (training_args
) for fine-tuning a transformer-based model using the Hugging Face Transformers library. Each argument plays an important role in configuring the training process.
Let’s discuss the importance of each argument:
output_dir
It specifies the directory where the trained model and other outputs (like evaluation results, checkpoints) will be saved.num_train_epochs
It defines the number of training epochs. An epoch is one complete pass through the entire training dataset. This parameter controls the number of times the model will see the entire dataset during training.per_device_train_batch_size
It specifies the batch size for each GPU during training. It determines the number of samples processed by the model in one forward and backward pass. A larger batch size can lead to faster training but requires more memory.per_device_eval_batch_size
It is similar toper_device_train_batch_size
, but specifies the batch size for evaluation (validation) data.warmup_steps
It specifies the number of steps during which the learning rate will increase linearly from 0 to the specified learning rate. Warm-up steps are often used to prevent the model from diverging during the early stages of training.weight_decay
It controls L2 regularization, which penalizes large weights in the model to prevent overfitting. It helps prevent the model from focusing too much on specific features of the training data.logging_dir
It specifies the directory where training logs (like loss values, evaluation metrics) will be saved.logging_steps
It determines how frequently training logs will be written during training. It specifies the number of training steps between each logging event.evaluation_strategy
It specifies when to perform evaluation during training. In this case, it’s set toepoch
, meaning evaluation will be performed at the end of each epoch.save_strategy
It specifies when to save model checkpoints during training. In this case, it’s set toepoch
, meaning a checkpoint will be saved at the end of each epoch.fp16
It enables mixed precision training if CUDA (GPU) is available. Mixed precision training uses both 16-bit and 32-bit floating-point precision to speed up training while reducing memory usage.load_best_model_at_end
It specifies whether to load the best model (based on evaluation performance) at the end of training. If set to True, the final model saved will be the one with the best performance on the validation set.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
fp16=True if device.type == 'cuda' else False,
load_best_model_at_end=True
)
Basically, each of the arguments in the code helps configure the training process and can significantly impact the performance and efficiency of the model training. Properly tuning these parameters can lead to faster convergence, better generalization, and improved overall performance of the trained model.
Create the Trainer and Training the Model
We create a Trainer instance and train the DistilBERT model using the specified training arguments and datasets. If you notice in the following code, we are using callbacks={EarlyStoppingCallback(early_stopping_patience=3)
.
Why are we using it?
The code callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
is using an early stopping callback during model training.
- Early Stopping: It is a technique used during the training of machine learning models to prevent overfitting and improve generalization. Instead of training the model for a fixed number of epochs, early stopping monitors the model’s performance on a validation dataset and stops training when the performance stops improving or starts deteriorating.
- EarlyStoppingCallback: It is a callback provided by the Transformers library, specifically for the Trainer class. A callback is a function that is executed at specific points during training. In this case, the EarlyStoppingCallback is called at the end of each evaluation step to check if the model’s performance on the validation dataset has improved.
- Early Stopping Patience: The
early_stopping_patience
parameter specifies how many evaluation steps to wait for before stopping training if the model’s performance does not improve. In this example, the patience is set to3
, meaning that if the model’s performance does not improve for 3 consecutive evaluation steps, training will stop. - Importance: The importance of early stopping is to prevent the model from overfitting to the training data. Overfitting occurs when the model learns to fit the training data too closely, capturing noise or random fluctuations instead of general patterns. Early stopping helps prevent overfitting by stopping training when the model’s performance on a separate validation dataset starts to degrade, indicating that further training may not improve generalization to unseen data.
In a nutshell, using the EarlyStoppingCallback with a specified patience parameter allows for more efficient training of machine learning models by automatically stopping training when it detects that further training is unlikely to improve performance on a validation dataset, thus helping to prevent overfitting.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)trainer.train()
Finally, our model is ready and we can make some predictions
Make Predictions on a Spam Email
We generate a sample spam email, tokenize it, and make predictions using the trained model. Let us focus on the #Make predictions part
of the following code.
- We are using logits to get the probabilities. The
logits
are computed by passing the input email through the model. Logits are unnormalized probabilities representing the model’s output before applying the softmax function. They contain information about the model’s confidence in each class. That is why we need these logits from the model to calculate proper probability distribution using softmax function below. - The
probabilities_sp
variable computes the probabilities of the input email belonging to each class (spam or not spam) by applying the softmax function to the logits. Softmax converts logits into probabilities, ensuring they sum to 1 and represent the model’s confidence distribution over classes. - The prediction is made using
prediction_sp
variable. It determines the final prediction by selecting the class with the highest probability. This is done using theargmax
function, which returns the index of the maximum value in theprobabilities_norm
tensor. - Finally, the code prints the prediction, indicating whether the input email is classified as spam or not spam based on the model’s prediction.
# Generate a spam email
spam_email = """
Subject: Exclusive Offer for You!Dear Subscriber,
You've been selected to receive an exclusive offer! Claim your prize now by clicking the link below:
Claim Your Prize: http://spammylink.com
Hurry, this offer is only available for a limited time!
Best Regards,
Spammy Marketing Team
"""
# Tokenize the spam email
spam_email_encoding = tokenizer(spam_email, truncation=True, padding=True, return_tensors='pt')
spam_email_encoding = {k: v.to(device) for k, v in spam_email_encoding.items()}
# Make predictions
logits = model(**spam_email_encoding).logits
probabilities_sp = torch.softmax(logits, dim=-1)
prediction_sp = torch.argmax(probabilities_sp)
print(f"The email is {'spam' if prediction_sp.item() == 1 else 'not spam'}")
Make Predictions on a Normal Email
Let us generate a sample normal email, tokenize it, and make predictions using the trained model.
# Generate a normal email
normal_email = """
Dear Team,I trust this message finds you well. I wanted to bring to your attention a discrepancy in order number 12345.
Unfortunately, one item was not included in the delivery. Given that it was offered as a complimentary item,
adjusting the price may not resolve the issue.
Could you please advise on the best course of action to rectify this situation?
Looking forward to your prompt response.
Best Regards,
ABC
"""
# Tokenize the normal email
normal_email_encoding = tokenizer(normal_email, truncation=True, padding=True, return_tensors='pt')
normal_email_encoding = {k: v.to(device) for k, v in normal_email_encoding.items()} # Move the encoding to the device
# Make predictions
logits = model(**normal_email_encoding).logits
probabilities_norm = torch.softmax(logits, dim=-1)
prediction_norm = torch.argmax(probabilities_norm)
print(f"The email is {'spam' if prediction_norm.item() == 1 else 'not spam'}")
In conclusion, the spam detection models in modern email communication systems are indispensable. These models, powered by advanced natural language processing techniques and pre-trained language models like DistilBERT, offer an effective solution to the persistent problem of spam emails. By filtering out unwanted and potentially harmful messages, these models not only streamline email management but also enhance user experience and cybersecurity.
Spam detection models contribute to efficient resource management by reducing the time users spend sifting through irrelevant emails. With fewer distractions in their inbox, users can focus more effectively on important communications, thereby boosting productivity. Moreover, these models play a crucial role in enhancing email security by identifying and mitigating potential cybersecurity threats posed by spam emails. By filtering out malicious links, attachments, and phishing attempts, spam detection models help protect users’ sensitive information and safeguard against cyberattacks.
For businesses, spam filters are vital for maintaining professional communication channels and safeguarding brand reputation. By ensuring that customers receive only relevant and legitimate messages, businesses can uphold their credibility and trustworthiness. Additionally, spam detection models aid in compliance with regulatory requirements related to data protection and privacy, thereby reducing legal risks.
In summary, spam detection models are essential tools for modern email communication systems. By leveraging cutting-edge technology and machine learning algorithms, these models offer an effective solution to the persistent problem of spam emails. They not only enhance user experience and productivity but also contribute to cybersecurity and brand reputation management. As email continues to be a primary mode of communication in both personal and professional settings, the importance of spam detection models in maintaining a secure and efficient email ecosystem cannot be overstated.