Powerful ideas and great tools are at the forefront of Machine learning success in recent years.
Over the last 60 years, researchers have been working on the idea of building machine-learning models based on our knowledge of the human brain, which gave birth to the field of deep learning. Thanks to programming and computing advances, we have access to the same open-source deep learning framework that top companies, such as Telsa and Uber, use.
PyTorch is probably one of the most used open-source deep learning frameworks. According to paperswithcode, as of Dec 2023 PyTorch represents 60% of all Machine Learning papers. Because of its widespread use, if you are planning to learn deep learning, you should be familiar with PyTorch.
PyTorch allows you to easily implement complex deep learning models in a few lines. However, to get the most out of its capabilities, you need to understand what it’s doing.
Here are the key fundamental building blocks and concepts to get started with PyTorch.
Tensors are the fundamental building block of deep learning and, therefore, of PyTorch.
You can think of a Tensor as a way to represent any type of data (images, sound, video…) in a numerical way
PyTorch offers a few simple ways of creating basic tensors.
import torchtensor = torch.tensor([3,2,1]) # creates a tensor
tensor
If you are familiar with Numpy arrays, you might notice that the tensor is very similar to a Numpy array.
import numpy as nparray = np.array([3,2,1]) #Cceates a simple array
array
They are indeed quite similar; however, there are a few key differences that make tensors particularly useful for deep learning.
One of the main differences is that tensors can be used in GPU. This allows the use of parallel processing to speed up operations. Numpy arrays have been optimized for CPU usage and cannot be processed by GPU.
Two other key differences is that Tensors must have uniformity of dimension sizes. This means that each dimension must have the same size (every row must have the same number of columns). This ensures consistency and simplifies tensor operations.
The last difference is that tensor can only take numeric data types.
When you work with tensors in PyTorch, there are 3 key attributes you need to be aware of:
- Shape of the tensor
- Tensor datatype
- What device is the tensor stored on
#Create a tensor
tensor = torch.rand(3, 4)# Find out details about it
print(tensor)
print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}") # will default to CPU
The above tensor has 2 dimensions; these 2-dimensional tensors are commonly called a Matrix.
A trick to know the number of dimensions is to count the number of brackets at the beginning of the tensor.
tensor_a = torch.tensor(2) # No Brackets -> 0 Dimension; this is also known as a scalar
tensor_b = torch.tensor([3,4,5]) # 1 Bracket -> 1 Dimension; this is also known as a vector
tensor_c = torch.tensor([[[1, 2, 3], # 3 Bracket -> 3 Dimension
[3, 6, 9],
[2, 4, 5]]])print(f"Shape of tensor_a: {tensor_a.shape}, number of dimension: {tensor_a.ndim}")
print(f"Shape of tensor_b: {tensor_b.shape}, number of dimension: {tensor_b.ndim}")
print(f"Shape of tensor_c: {tensor_c.shape}, number of dimensions: {tensor_c.ndim}")
Why is the tensor shape important?
Deep learning involves performing operations on Tensors. You can perform many operations on Tensors, however the most important is Matrix multiplication and matrices have strict rules about how different shapes can be combined.
The 2 rules of Matrix multiplication:
- The inner dimension must match: The number of columns of the 1st matrix must equal the number of rows of second matrix
- (4,3) @ (2,4) -> doesn’t work
- (3,4) @ (4,2) -> works
2. The resulting matrix has the shape of the outer dimensions: it will have the same number of rows as the 1st matrix and the same number of columns as the second matrix
- (3,4) @ (4,2) -> (3,2)
- (2,3) @ (3,2) -> (2,2)
These rules are probably the cause of the most common error you will encounter in deep learning: the shape error.
tensor_a = torch.rand(4,3)
tensor_b = torch.rand(2,3)#Perfom matrix multiplication
torch.matmul(tensor_a,tensor_b) #This won't work
As expected, we get a shape error because the inner dimensions do not match. However, we can see that the outer dimension of tensor_b
matches the inner dimension of tensor_a
.
Therefore, we can get around this error by Transposing tensor_b
.
tensor = torch.mm(tensor_a, tensor_b.T)# you can use torch.matmul() or torch.mm()print(tensor)
print(f"The shape of tensor_a is: {tensor_a.shape}")
print(f"The shape of the transposed tensor_b is: {tensor_b.T.shape}")
print(f"The shape of the resulting tensor is: {tensor.shape}")
By default, tensors in PyTorch are of type torch.float32.
However, tensor can take a vast array of datatype, from torch.float64, torch.float16, torch.int8 ….
The reason for all these different datatypes has to do with computing precision. You need to be aware that higher precision values lead to more detail; however, they also take up more memory.
The datatypes are important because another common error you will face is dtype
mismatch. You can easily change a tensor dtype
with torch.Tensor.type(dtype=None)
tensor_a = torch.rand(3,4)print(f"tensor_a datatype is: {tensor_a.dtype}")
#Change tensor_a datatype
tensor_b = tensor_a.type(dtype=torch.float64)
print(f"tensor_b datatype is: {tensor_b.dtype}")
As we mentioned before, one of the key features of a tensor is that we can run calculations on a GPU.
PyTorch runs its acceleration on CUDA-compatible Nvidia GPUs. (“CUDA” stands for Compute Unified Device Architecture, which is Nvidia’s platform for parallel computing.)
To run code on a GPU, you first have to get access to a GPU. You can get free GPU via Google collab, or in Kaggle notebooks.
By default, tensors are created on the CPU, but fortunately, PyTorch makes it easy to move tensors to the GPU.
# Check if GPU is available
torch.cuda.is_available()
Another very common error you will encounter is device mismatch. This occurs when you try to perform operations on tensors that are on different devices (one on the CPU and another on the GPU).
To avoid those errors, we need to set up a device-agnostic code. To do so, we can define a device
variable that uses the GPU if available; otherwise, it returns CPU.
# device agnotisc code
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
The next step is to move our tensors to the GPU. The great advantage of creating device-agnostic code is that the below code will run even if the GPU is not available.
# Create tensor (default on CPU)
tensor = torch.tensor([1, 2, 3])# Tensor not on GPU
print(tensor, tensor.device)
# Move tensor to GPU (if available)
tensor_on_gpu = tensor.to(device)
print(f"Tensor is now on device: {tensor_on_gpu.device}")
Now that we are familiar with the PyTorch key building block, it is time to look into the fundamental PyTorch workflow.
The most fundamental PyTorch workflow consists of:
1.Getting the data ready: turn the data into tensors
2.Build a model
- Pick a loss function
- Build a training loop
3. Fit the model and make predictions
4.Evaluate the model
5.Improve the model through experimentation
6.Save and reload your trained model
We will focus on the second step. Our goal is to understand what PyTorch is doing so that we can create and tune complex neural networks in the future. To show the inner workings of PyTorch, we will use a simple example with known parameters.
#Generate some data# Known Parameters, this is what we will try to predict
a, b, c = 0.2, 0.3, -0.1
X = torch.arange(-1,1,0.001).unsqueeze(dim=1)#This add a dimension(an extra set of brakets) -> necessary to avoid errors
y = a * X ** 2 + b * X + c
X[:10], y[:10]
Even though this is a simple example, we cannot forget to split our data.
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=42)
len(X_train), len(y_train), len(X_test), len(y_test)
Let’s also create a helper function to visualize our data.
import matplotlib.pyplot as pltdef plot_predictions(train_data = X_train,
train_labels=y_train,
test_data = X_test,
test_labels = y_test,
predictions= None):
"""
Plots training data, test data and compares predictions
"""
plt.figure(figsize=(10,7))
#Plot for the training data
plt.scatter(train_data, train_labels,c='forestgreen',s=4,label="Training data")
#Plot for the test data
plt.scatter(test_data, test_labels,c='bisque',s=4,label="Test data")
#Plot the predictions if they exist
if predictions is not None:
plt.scatter(test_data,predictions, c='r',s=4, label="Predictions")
#Showing the legend
plt.legend()
#Plotting the data
plot_predictions()
PyTorch comes with many modules that make it very easy to create and train neural networks.
The first module we will use is nn.Module
. This module contains the base class for all neural networks and will help us define our model and build the neural network layers.
import torch.nn as nn
class RegressionModel(nn.Module):
def __init__(self):
super().__init__()
self.a = nn.Parameter(torch.randn(1,
requires_grad=True,
dtype=torch.float))
self.b = nn.Parameter(torch.randn(1,
requires_grad=True,
dtype=torch.float))
self.c = nn.Parameter(torch.randn(1,
requires_grad=True,
dtype=torch.float))def forward(self, x:torch.Tensor) -> torch.Tensor:
return self.a * x ** 2 + self.b * x + self.c
Let’s unpack what is going on here.
We create a RegressionModel
class that inherits from nn.Module
class. This allows us to use all the methods inside nn.Module
.
We define the init method to set the model’s components. In our case, we manually defined the components of the model; however, you will rarely have to do this. You will usually define the layers of your network, and those will automatically define the model parameters.
However, I wanted to draw your attention to nn.Parameter
class. This is a special class in PyTorch to represent learnable parameters. It is a subclass of torch.Tensor
, which means that it has the same property as a tensor. When you create an instance of nn.Parameter
as an attribute of nn.Module
class, the parameters get registered in the module, which means that PyTorch keeps track of them. This might not seem like much, but it allows an optimizer algorithm to identify those parameters as parameters to be optimized during training.
Note that we use the method torch.randn()
, this is one of the key concepts in deep learning: the initial parameters are random, and the model will adjust those parameters as it learns from the data.
Every time we create a model subclassing nn.Module
we are required to define the forward()
method.
This method defines what computations your model performs to the input data ( in Tensor format) through the network layers to produce the output.
Once again, in our example, we define the computation directly because we know the actual formula. However, when you train your neural network, you don’t know the formula, so you will use PyTorch functionality to build it.
One key thing to remember is that you don’t directly call the forward()
method, instead you just call the instance of your model class with the input data, and this indirectly triggers the forward()
method.
#Create a random seed
torch.manual_seed(42)#Instanciate our RegressionModel class
model = RegressionModel()
#Print the initial paramater values
print(model.state_dict())
# This calls the forward method and applies the defined transformation
predictions = model(X_train)
predictions
So what happened here?
When we called model(X_train),
this method indirectly called the forward()
method on the X_train
data.
This step used the initial random values for our defined parameters (a,b,c,) and applied the computation defined in our forward()
method, in our case:
a * x ** 2 + b * x + c
Let’s see an example so this becomes more clear.
#Extracting the first value in our X_train tensor
x_0 =X_train[0][0]# Extracting the model parameter values into a list
values = [value.item() for value in model.state_dict().values()]
# Unpacking the list into variables
a_val, b_val, c_val = values
#Applies the initial weight to our computation
result = a_val * x_0 ** 2 + b_val * x_0 + c_val
print(f"The result of our computation on the first value of X_train is: {result}")
print(f"\nLet's check if the result value is the same as the first value in our predictions")
print(f"result == prediction: {torch.eq(result, predictions[0][0])}")# this checks if two tensors are equal
Great, now we have an understanding of our initial setup:
- Instantiate the model with random parameters
- Pass the data and apply the defined computation to create some predictions
However, because the parameters are random, we can expect the results to be pretty bad.
# let's make some predictions of our test data and visualize the performance of our initial parameters
with torch.inference_mode(): # PyTorch context manager for better performance when testing your model
y_preds = model(X_test)plot_predictions(predictions=y_preds)
Clearly, our initial parameters do not do a good job of predicting, mostly because those are just random numbers.
The key concept of a Neural network is that we start with random parameters, and the model updates the parameters to better represent the data.
For our model to update its parameters, we need:
- Loss function: Measure how wrong the model predictions are compared to the truth labels. The lower, the better
- Optimizer: Tells your model how to update its internal parameters to lower the loss function
PyTorch offers two key modules:
Autograd
: automatically computes gradients of tensorstorch.optim
: helps with the implementation of various optimization algorithms
These two modules work together to update the model parameters to better fit the data.
Let’s see how they do that step-by-step.
# Define a loss function - in our case MAE (torch.nn.L1Loss())
loss_fn = nn.L1Loss()# Define the optimizer function - there a many functions, but common ones are
# optim.DGD(), optim.Adam()
optimizer = torch.optim.SGD(params = model.parameters(),
lr=0.01)
Let’s start with Autograd
.
As we have mentioned before, when we train a neural network model, we start with random parameters, and the model adjusts those parameters to better fit the data.But the question is, how do we adjust those parameters?
The first step is to define a loss function that tells us how far the model predictions are from the truth.
Intuitively, to improve the model, we should adjust the parameters to minimize the loss function. Mathematically, this involves finding the point where the derivative (gradient) of the loss function with respect to the model’s parameters is zero.
This is where Autograd
comes in. You don’t have to calculate the gradients manually. Autograd
automates the calculation of derivatives by efficiently computing the gradient of the loss with respect to each parameter and keeping track of all operations performed on tensors.
To calculate the gradient of the current tensor, we need to call tensor.backward
#Initial gradient for model paramters
print(f"These are the intial random values for the model parameter:{model.state_dict()}")
print(f"The initial gradients are: {model.a.grad,model.b.grad,model.c.grad}")# let's compute the loss on the training predictions
loss = loss_fn(predictions,y_train)
print(f"\nLets calculate the gradients thanks to Autograd")
# call loss.backward()
loss.backward()
# check the gradients again
print(f"The gradients for each parameter are: {model.a.grad,model.b.grad,model.c.grad}")
print(f"But model paramter remain unchanged: {model.state_dict()}")
We can see that Autograd calculated the gradients for each parameter, but the model remains unchanged. The optimizer is responsible for updating model parameters based on the computed gradients.
To update the parameters, we just need to step the optimizer.
# Optimize the model based on the calculated gradients
optimizer.step()print(f"The new parameters after optimization are:{model.state_dict()}")
print(f"The current gradients are: {model.a.grad,model.b.grad,model.c.grad}")
As you can see, the optimizer has updated the parameters.
Let’s check the model predictions with the new parameters.
# let's make some predictions of our test data
with torch.inference_mode():
y_preds = model(X_test)plot_predictions(predictions=y_preds)
There seems to be some improvement. However, each time we step the optimizer, we only take some steps toward the direction indicated by the gradients. The size of the step depends on the optimizer parameters—more on this later.
let’s step the optimizer again.
# use forward method with the new parameters
print(f"Model parameters: {model.state_dict()}")predictions = model(X_train)
# let's compute the loss on the training predictions
loss = loss_fn(predictions,y_train)
# call loss.backward() to calculate the gradients
loss.backward()
print(f"The current gradients are: {model.a.grad,model.b.grad,model.c.grad}")
One important thing about the process:
After calling the optimizer.step(), you need to call optimizer.zero_grad(), or else every time you run loss.backward(), the gradients on the learning weights will accumulate.
As you can see, the above gradients look like double the initial gradients we calculated earlier.
#let's clear the gradients
optimizer.zero_grad()
print(f"After clearning the gradients are: {model.a.grad,model.b.grad,model.c.grad}")print(f"\nLet's try to optimize again")
predictions = model(X_train)
# let's compute the loss on the training predictions
loss = loss_fn(predictions,y_train)
# call loss.backward() to calculate the gradients
loss.backward()
print(f"The current gradients are: {model.a.grad,model.b.grad,model.c.grad}")
optimizer.step()
print(f"\nThe new parameters after second optimization are:{model.state_dict()}")
print(f"Let's clear the gradients for next optimization")
optimizer.zero_grad()
print(f"After clearning the gradients are: {model.a.grad,model.b.grad,model.c.grad}")
# let's make some predictions of our test data
with torch.inference_mode():
y_preds = model(X_test)plot_predictions(predictions=y_preds)
As you can see, to get better results, we run through the above steps:
- Making predictions on training set
- Calculating the loss- using the defined loss function
- Calculating the gradients thanks to Autograd, using the
loss.backward()
method - Updating the parameters via the optimizer
This is where we can use the power of Python loops to automate the training. This is called the training loop.
But before we write the complete training loop, I want to come back to the size of the steps in the optimizer. When we define our optimizer, we pass the model parameters and the learning rate (lr
) hyperparameter.
The learning rate controls how big of a step the optimizer is making on the parameters. High lr
means the optimizer will try larger updates, these can sometimes be too large, and the optimizer will fail to work. Low lr
values mean the optimizer will try smaller updates (these can sometimes be too small, and the optimizer will take too long to find the ideal values).
#let' see the impact of the Learning rate hyperparameter
#set random seed
torch.manual_seed(42)
model_0 = RegressionModel()
model_1 = RegressionModel()optimizer_0 = torch.optim.SGD(params= model_0.parameters(),
lr =0.01)
optimizer_1 = torch.optim.SGD(params= model_1.parameters(),
lr =0.1)
#Initial model paramters
print(f"These are the intial random values for model_0 parameters:{model_0.state_dict()}")
print(f"These are the intial random values for model_1 parameters:{model_0.state_dict()}")
#Make predictions on training data with both models
predictions_0 = model_0(X_train)
predictions_1 = model_1(X_train)
# let's compute the loss on the training predictions
loss_0 = loss_fn(predictions_0,y_train)
loss_1 = loss_fn(predictions_1,y_train)
print(f"\nLets calculate the gradients thanks to Autograd")
# call loss.backward()
loss_0.backward()
loss_1.backward()
# check the gradients for each model
print(f"\nThe gradients for each parameters in model_0 are: {model_0.a.grad,model_0.b.grad,model_0.c.grad}")
print(f"The gradients for each parameters in model_1 are: {model_1.a.grad,model_1.b.grad,model_1.c.grad}")
#Optimize the model parameters
optimizer_0.step()
optimizer_1.step()
#let's check the parameter values after the optimization with different `lr` values
print(f"\nThe parameters for model_0 with lr = 0.01 are : {model_0.state_dict()} ")
print(f"The parameters for model_1 with lr = 0.1 are : {model_1.state_dict()} ")
In the above code, the only difference between both models is the learning rate lr
parameter in the optimizer.
As you can see, the model with the highest learning rate in the optimizer has made bigger jumps in the parameter values. Some common starting values for the learning rate are 0.01, 0.001, and 0.0001; however, these can also be adjusted over time (this is called learning rate scheduling).
Now let’s create the training loop.
#Instanciate the model with random parameters
model = RegressionModel()# There are many loss function available depending on your problem
# Choose the one that best suits your problem
loss_fn = nn.L1Loss()
# Define the optimizer function - there a many functions, but common ones are
# optim.SGD(), optim.Adam() - pick the one that best suits your problem
optimizer = torch.optim.SGD(params = model.parameters(),
lr=0.01)
#Epochs represent how many times we are going to run the training loop
#This is another hyperparameter
epochs = 1000
for epoch in range(epochs):
model.train()# Set the model to training mode
#1. Pass the data using `forward()` method
y_pred_train = model(X_train)
#2. Calculate the loss
loss = loss_fn(y_pred_train,y_train)
#3. Zero the gradients on the optimizer
optimizer.zero_grad()
#4. Calculate the gradients thanks to Autograd
loss.backward()
#5. Update the model paramters with the optimizer
optimizer.step()
## Create a testing loop to monitor overfitting
model.eval()#put the model in evaluation mode for testing
with torch.inference_mode():
#1. Forward pass
test_preds = model(X_test)
#2. Calculate the loss
test_loss = loss_fn(test_preds, y_test)
if epoch % 100 ==0:
print(f"Epoch: {epoch} | Train loss: {loss}| Test Loss: {test_loss}")
#make predictions on test data
with torch.inference_mode():
y_preds = model(X_test)
y_predsplot_predictions(predictions=y_preds)
print(f"The known parameters are: a ={a}, b={b}, c={c}")
print(f"The model predicted parameters are: {model.state_dict()}")
The model managed to go from random parameters to very close to our initial parameters.
This covers the fundamentals of PyTorch’s building blocks and workflow. We have seen how each step of the training loop works and why each is necessary for proper PyTorch functionality. This gives you a solid understanding to create and customize more complex models.
[1] “Understanding the PyTorch Module Forward Function: A PyTorch Tutorial.” TutorialExample, https://utorialexample.com/understand-pytorch-module-forward-function-pytorch-tutorial/.
[2] “Zero to Mastery Learn PyTorch for Deep Learning.”, Daniel Bourke, https://www.learnpytorch.io/
[3] “The Fundamentals of Autograd”, Pytorch , https://pytorch.org/tutorials/beginner/introyt/autogradyt_tutorial.html
[4] “What is PyTorch?”, geeksforgeeks, https://www.geeksforgeeks.org/getting-started-with-pytorch/