We will follow a guided example so that everything is easier to understand. Suppose we have input a batch of 2 images of size 320×320 into the model. Our dataset has 20 classes, and the number of anchors per layer is 3. Our model uses the default three prediction layers of the YOLOv5 architecture, with strides [P3: 8, P4: 16, P5: 32].
The input variables for the loss function are p and targets:
- p is a list of torch.Tensor objects, each one corresponding to a different prediction layer (small:P3, medium:P4 and large:P5 objects). Each tensor has a shape (batch_size, num_anchors, num_cells_y, num_cells_x, 5+num_classes).
- targets is a torch.Tensor object of shape (num_targets, 6). Each element contains (img_id, class, x, y, w, h) for each ground truth. All coordinates and size values (x, y, w, h) are scaled in the range [0, 1]. So, if the bounding box of an object has values (280, 130, 30, 50), its scaled values would be (280/img_size_x, 130/img_size_y, 30/img_size_x, 50/img_size_y).
In this case, the prediction heads would output 3 tensors of shape:
- P3: (2, 3, 320//8, 320//8, 5+20) = (2, 3, 40, 40, 25)
- P4: (2, 3, 320//16, 320//16, 5+20) = (2, 3, 20, 20, 25)
- P5: (2, 3, 320//32, 320//32, 5+20) = (2, 3, 10, 10, 25)
Let’s suppose that Image 1 has 3 objects and Image 2 has 2 objects. In total, we have 5 target objects (ground truths). Therefore, targets would have shape (5, 6).
Below is the code that defines the initial variables for our analysis:
device = 'cpu'
img_size = 320
num_classes = 20; num_layers = 3
anchor_t = 4.0# Loss weights
balance = [4.0, 1.0, 0.4]
lambda_box = 0.05; lambda_obj = 0.7; lambda_cls = 0.3
anchors = torch.tensor([
# P3 anchors
[[ 1.25000, 1.62500],[ 2.00000, 3.75000],[ 4.12500, 2.87500]],
# P4 anchors
[[ 1.87500, 3.81250],[ 3.87500, 2.81250],[ 3.68750, 7.43750]],
# P5 anchors
[[ 3.62500, 2.81250],[ 4.87500, 6.18750],[11.65625, 10.18750]],
])
assert anchors.shape[0] == num_layers
num_anchors = anchors.shape[1]
targets = torch.tensor([
[ 0.00000, 14.00000, 0.49535, 0.50528, 0.15267, 0.56956],
[ 0.00000, 0.00000, 0.54872, 0.92491, 0.05361, 0.03183],
[ 0.00000, 0.00000, 0.36780, 0.98716, 0.06031, 0.02567],
[ 1.00000, 6.00000, 0.97072, 0.04398, 0.05856, 0.08796],
[ 1.00000, 16.00000, 0.70696, 0.10348, 0.32971, 0.16793],
])
batch_size = len(targets[:,:1].unique())
strides = [8, 16, 32]
p = [
torch.randn((batch_size, num_anchors, img_size//strides[i], img_size//strides[i], 5 + num_classes))
for i in range(num_layers)
]
print("Targets Shape:", targets.shape)
print("Anchors Shape:", anchors.shape)
for i, pi in enumerate(p):
print(f"Layer P{i+3} Shape:", pi.shape)
====================================================================
>>> Targets Shape: torch.Size([5, 6])
>>> Anchors Shape: torch.Size([3, 3, 2])
>>> Layer P3 Shape: torch.Size([2, 3, 40, 40, 25])
>>> Layer P4 Shape: torch.Size([2, 3, 20, 20, 25])
>>> Layer P5 Shape: torch.Size([2, 3, 10, 10, 25])
The file we are going to analyze is located in utils/loss.py. In case the repository changes in the future after publishing this article, and maybe the link provided is broken or the code has changed, I will leave here the current GitHub commit I used in the analysis, so you can go to utils/loss.py and review the code we are going to examine now.
The important code part is defined in the ComputeLoss class. Let’s review each method of this class:
__init__
class ComputeLoss:
sort_obj_iou = False# Compute losses
def __init__(self, model, autobalance=False):
"""Initializes ComputeLoss with model and autobalance option, autobalances losses if True."""
device = next(model.parameters()).device # get model device
h = model.hyp # hyperparameters
# Define criteria
BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h["cls_pw"]], device=device))
BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h["obj_pw"]], device=device))
# Class label smoothing eqn 3
self.cp, self.cn = smooth_BCE(eps=h.get("label_smoothing", 0.0)) # positive, negative BCE targets
# Focal loss
g = h["fl_gamma"] # focal loss gamma
if g > 0:
BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)
m = de_parallel(model).model[-1] # Detect() module
self.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, 0.02]) # P3-P7
self.ssi = list(m.stride).index(16) if autobalance else 0 # stride 16 index
self.BCEcls, self.BCEobj, self.gr, self.hyp, self.autobalance = BCEcls, BCEobj, 1.0, h, autobalance
self.na = m.na # number of anchors
self.nc = m.nc # number of classes
self.nl = m.nl # number of layers
self.anchors = m.anchors
self.device = device
This first part initializes the loss class function. The important things to pay attention to here are:
- BCEcls and BCEobj are BCEWithLogitsLoss instances, with the possibility to specify a weight for the positive sample loss, which is useful in cases where the dataset is imbalanced.
- Label smoothing by default is not used. Therefore, self.cp = 1 (class positive value) and self.cn = 0 (class negative value).
- Focal loss is also not used by default, g = h[“fl_gamma”] = 0.
- de_parallel() function removes DataParallel or DistributedDataParallel wrappers in case of multiple GPUs being used and returns a single GPU model. It is not important for understanding the loss function.
- self.balance will be [4.0, 1.0, 0.4] when number of layers is 3, which is our case.
- We can ignore self.ssi, because we are not going to use autobalance (changes iteratively the balance value of each layer objectness loss weight). This feature is useful for optimizing the balance parameter for custom datasets, but it is not used by default.
- self.na is the number of anchors in each layer and by default is 3, self.nc is the number of classes in the dataset and self.nl is the number of prediction layers, which by default is also 3.
- self.anchors stores the predefined anchors of each prediction layer.
build_targets
The __call__ method performs the forward pass, calculating the losses for each prediction layer. Before explaining how the __call__ method computes the loss, let’s first describe the build_targets method. This method is invoked in the initial lines of the __call__ function and it is responsible for assigning targets to cell anchors and preparing them for loss computation according to the YOLOv5 formulation. Let’s go through this step by step:
def build_targets(self, p, targets):
"""Prepares model targets from input targets (image,class,x,y,w,h) for loss computation, returning class, box,
indices, and anchors.
"""
na, nt = self.na, targets.shape[0] # number of anchors, targets
tcls, tbox, indices, anch = [], [], [], []
gain = torch.ones(7, device=self.device) # normalized to gridspace gain
ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt) # same as .repeat_interleave(nt)
targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2) # append anchor indicesg = 0.5 # bias
off = (
torch.tensor(
[
[0, 0],
[1, 0],
[0, 1],
[-1, 0],
[0, -1], # j,k,l,m
# [1, 1], [1, -1], [-1, 1], [-1, -1], # jk,jm,lk,lm
],
device=self.device,
).float()
* g
) # offset
This is the first part of the function. Steps:
- Store the values of number of anchors and targets (na = 3, nt = 5)
- Initialize the output of the function (tcls, tbox, indices, anch) with empty lists.
- Initialize the gain tensor that will be used later for scaling the targets in each layer (shape=(7,)).
- Map each target to each anchor
ai = torch.arange(na, device=self.device).float().view(na, 1).repeat(1, nt) # same as .repeat_interleave(nt)
targets = torch.cat((targets.repeat(na, 1, 1), ai[..., None]), 2) # append anchor indices
The purpose of the above 2 lines of code is to create a tensor that maps each target to each anchor. We have 3 anchors in each prediction layer, so we want to compare each target (GT) to each of the 3 anchors, resulting in 5*3=15 comparisons. To achieve this, we repeat the target tensor (Size([5,6])) 3 times along a new first dimension, creating a tensor of shape [3, 5, 6]. Then, we append the index of the anchor (ai) to each target array, resulting in a shape of [3, 5, 7], where each target contains (img_id, class, x, y, w, h, anchor_id).
Note: ai[…, None] is equivalent to ai.unsqueeze(-1), which adds a size-one dimension at the end. Size([3,5]) -> Size([3,5,1]).
ai
>>> tensor([[0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2.]]) # Anchor indicestargets
>>> tensor([
[[ 0.0000, 14.0000, 0.4954, 0.5053, 0.1527, 0.5696, 0.0000],
[ 0.0000, 0.0000, 0.5487, 0.9249, 0.0536, 0.0318, 0.0000],
[ 0.0000, 0.0000, 0.3678, 0.9872, 0.0603, 0.0257, 0.0000],
[ 1.0000, 6.0000, 0.9707, 0.0440, 0.0586, 0.0880, 0.0000],
[ 1.0000, 16.0000, 0.7070, 0.1035, 0.3297, 0.1679, 0.0000]],
[[ 0.0000, 14.0000, 0.4954, 0.5053, 0.1527, 0.5696, 1.0000],
[ 0.0000, 0.0000, 0.5487, 0.9249, 0.0536, 0.0318, 1.0000],
[ 0.0000, 0.0000, 0.3678, 0.9872, 0.0603, 0.0257, 1.0000],
[ 1.0000, 6.0000, 0.9707, 0.0440, 0.0586, 0.0880, 1.0000],
[ 1.0000, 16.0000, 0.7070, 0.1035, 0.3297, 0.1679, 1.0000]],
[[ 0.0000, 14.0000, 0.4954, 0.5053, 0.1527, 0.5696, 2.0000],
[ 0.0000, 0.0000, 0.5487, 0.9249, 0.0536, 0.0318, 2.0000],
[ 0.0000, 0.0000, 0.3678, 0.9872, 0.0603, 0.0257, 2.0000],
[ 1.0000, 6.0000, 0.9707, 0.0440, 0.0586, 0.0880, 2.0000],
[ 1.0000, 16.0000, 0.7070, 0.1035, 0.3297, 0.1679, 2.0000]]])
5. Define the offsets in each grid direction (j, k, l, m) (left, up, right, down)
# j,k,l,m
torch.tensor(
[[0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]],
device=self.device
).float() * g
These offsets will be subtracted from the built-targets grid coordinates (gxy – offsets), so a 1 actually represents a -1 unit in that dimension. For example, [1, 0] (j), indicates subtracting 1 unit in the x-dimension, referring to the left adjacent cell.
The g term scales these adjustments to 0.5 units, which is sufficient because positive offsets [0.5, 0] and [0, 0.5] are always subtracted from values with a decimal part less than 0.5, while negative offsets [-0.5, 0] and [0, -0.5] are always subtracted from values with a decimal part greater than 0.5. This consistently changes the integer part of the grid coordinate value, and therefore the grid cell index.
Now, let’s explain the logic behind the target-anchors assignment:
for i in range(self.nl):
anchors, shape = self.anchors[i], p[i].shape
gain[2:6] = torch.tensor(shape)[[3, 2, 3, 2]] # xyxy gain# Match targets to anchors
t = targets * gain # shape(3,n,7)
if nt:
# Matches
r = t[..., 4:6] / anchors[:, None] # wh ratio
j = torch.max(r, 1 / r).max(2)[0] < self.hyp["anchor_t"] # compare
# j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t'] # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
t = t[j] # filter
# Offsets
gxy = t[:, 2:4] # grid xy
gxi = gain[[2, 3]] - gxy # inverse
j, k = ((gxy % 1 < g) & (gxy > 1)).T
l, m = ((gxi % 1 < g) & (gxi > 1)).T
j = torch.stack((torch.ones_like(j), j, k, l, m))
t = t.repeat((5, 1, 1))[j]
offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
else:
t = targets[0]
offsets = 0
For each prediction layer output (let’s assume we are using the P3 output, i = 0) we get the anchors for that layer, determine the output shape and scale x, y, w, h with respect to the grid size of that layer.
anchors, shape = self.anchors[i], p[i].shape
gain[2:6] = torch.tensor(shape)[[3, 2, 3, 2]] # xyxy gain# Match targets to anchors
t = targets * gain # shape(3,n,7)
i
>>> 0anchors
>>> tensor([[1.2500, 1.6250],
[2.0000, 3.7500],
[4.1250, 2.8750]])
shape
>>> torch.Size([2, 3, 40, 40, 25])
gain
>>> tensor([1., 1., 40., 40., 40., 40., 1.])
t
>>> tensor([[
[ 0.0000, 14.0000, 19.8140, 20.2112, 6.1068, 22.7824, 0.0000],
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 0.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 0.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 0.0000],
[ 1.0000, 16.0000, 28.2784, 4.1392, 13.1884, 6.7172, 0.0000]],
[[ 0.0000, 14.0000, 19.8140, 20.2112, 6.1068, 22.7824, 1.0000],
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 1.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 1.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 1.0000],
[ 1.0000, 16.0000, 28.2784, 4.1392, 13.1884, 6.7172, 1.0000]],
[[ 0.0000, 14.0000, 19.8140, 20.2112, 6.1068, 22.7824, 2.0000],
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 2.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 2.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 2.0000],
[ 1.0000, 16.0000, 28.2784, 4.1392, 13.1884, 6.7172, 2.0000]]])
We then check if the anchors meet the requirement rmax < anchor_t, which we reviewed previously. Here, r (Size([3,5,2])) contains the rw and rh target-anchor ratios. Using torch.max(r, 1 / r).max(2)[0], we obtain rmax, while j (Size([3,5])) represents a boolean mask indicating whether each target-anchor pair meets the requirement. Finally, as the last step, t is filtered to only contain those that meet this requirement, resulting in a change in its size to [num_pairs_selected, 7]:
# Matches
r = t[..., 4:6] / anchors[:, None] # wh ratio
j = torch.max(r, 1 / r).max(2)[0] < self.hyp["anchor_t"] # compare
t = t[j] # filter
j
>>> tensor([[False, True, True, True, False],
[False, True, True, True, False],
[False, True, True, True, True]])t[j]
>>> tensor([
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 0.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 0.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 0.0000],
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 1.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 1.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 1.0000],
[ 0.0000, 0.0000, 21.9488, 36.9964, 2.1444, 1.2732, 2.0000],
[ 0.0000, 0.0000, 14.7120, 39.4864, 2.4124, 1.0268, 2.0000],
[ 1.0000, 6.0000, 38.8288, 1.7592, 2.3424, 3.5184, 2.0000],
[ 1.0000, 16.0000, 28.2784, 4.1392, 13.1884, 6.7172, 2.0000]])
t[j].shape
>>> torch.Size([10, 7])
Now that we have the target-anchor pairs that passed the filter, let’s assign them to the cell that contains their center point and also to the adjacent cells, as reviewed earlier, depending on the location of the center point within the cell.
# Offsets
gxy = t[:, 2:4] # grid xy
gxi = gain[[2, 3]] - gxy # inverse
j, k = ((gxy % 1 < g) & (gxy > 1)).T
l, m = ((gxi % 1 < g) & (gxi > 1)).T
To explain this part, which can be a little confusing at first, let’s clarify two things. First, the x % 1 operator is used to obtain the decimal part of a number x (e.g., 3.72 % 1 = 0.72). Second, the variables j, k, l, m correspond to each direction in the grid cell:
- The variables j and l respectively stores whether the center point is on the left/right side of the middle vertical line of the cell and if there’s a cell to the left/right of the current one or if we’re at the grid’s edge. If both conditions are true, the left/right adjacent cell will also be selected.
- The variables k and m respectively stores whether the center point is above/below the middle horizontal line of the cell and if there’s a cell above/below the current one or if we’re at the grid’s edge. If both conditions are true, the upper/bottom adjacent cell will also be selected.
To compute the conditions for l and m, the strategy involves reversing the coordinates of the origin of the grid cell (gxi = gain[[2, 3]] – gxy) and applying the same conditions as in j and k, respectively. This process is illustrated in the figure below:
Once all conditions are computed, a large boolean mask is created to select all main cells (where the center point lies) and their respective adjacent cells selected (stored in j, k, l, m).
j = torch.stack((torch.ones_like(j), j, k, l, m))
t = t.repeat((5, 1, 1))[j]
j
>>> tensor([
# Select main cell, where the object center point lies
[ True, True, True, True, True, True, True, True, True, True],
# Select the adjacent cell to the left of the main cell
[False, False, False, False, False, False, False, False, False, True],
# Select the adjacent cell above the main cell
[False, True, False, False, True, False, False, True, False, True],
# Select the adjacent cell to the right of the main cell
[ True, True, True, True, True, True, True, True, True, False],
# Select the adjacent cell below the main cell
[ True, False, True, True, False, True, True, False, True, False]])t.shape
>>> torch.Size([30, 7])
So, the number of built-targets will range from the minimum of the number of first filtered target-anchor pairs (10) to three times that (30), due to the possibility of selecting up to two more cells per main cell.
However, the built-targets stored at this point in t are not accurate because the (x, y) coordinates still refer to the main cell. Therefore, the offsets are computed using the previously defined direction offsets (off) and are stored to be applied in the next and final step.
offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
As the last step of this function, we prepare the final built-targets for loss computation:
# Define
bc, gxy, gwh, a = t.chunk(4, 1) # (image, class), grid xy, grid wh, anchors
a, (b, c) = a.long().view(-1), bc.long().T # anchors, image, class
gij = (gxy - offsets).long()
gi, gj = gij.T # grid indices# Append
indices.append((b, a, gj.clamp_(0, shape[2] - 1), gi.clamp_(0, shape[3] - 1))) # image, anchor, grid
tbox.append(torch.cat((gxy - gij, gwh), 1)) # box
anch.append(anchors[a]) # anchors
tcls.append(c) # class
return tcls, tbox, indices, anch
In this step, the corresponding grid cell indices for each built-target are computed using the previously calculated offsets (gij = (gxy – offsets).long()). This operation extracts the integer part (cell indices) of the modified (x, y) coordinates:
gxy[12:17]
>>> tensor([[14.7120, 39.4864],
[14.7120, 39.4864],
[28.2784, 4.1392],
[21.9488, 36.9964],
[14.7120, 39.4864]])offsets[12:17]
>>> tensor([[ 0.0000, 0.5000],
[ 0.0000, 0.5000],
[ 0.0000, 0.5000],
[-0.5000, 0.0000],
[-0.5000, 0.0000]])
gxy[12:17] - offsets[12:17]
>>> tensor([[14.7120, 38.9864],
[14.7120, 38.9864],
[28.2784, 3.6392],
[22.4488, 36.9964],
[15.2120, 39.4864]])
gij[12:17]
>>> tensor([[14, 38],
[14, 38],
[28, 3],
[22, 36],
[15, 39]])
Finally, the function returns 4 outputs:
- indices (list[tuple[Tensor]]): A list of 3 tuples, one for each layer, containing tensors representing indices for sample indices, cell anchor indices, and grid cell indices. They are used to extract the corresponding model predictions, in which a ground truth object has been assigned.
- tbox (list[Tensor]): A list of 3 tensors, one for each layer, containing target bounding boxes (x, y, w, h). The (x, y) values are normalized in the range of -0.5 to 1.5, while (w, h) values are adjusted with respect to the layer grid size, ranging from 0 to the number of grid cells along the corresponding axis.
- anch (list[Tensor]): A list of 3 tensors, one for each layer, containing the anchor (w, h) values of each selected cell anchor.
- tcls (list[Tensor]): A list of 3 tensors, one for each layer, containing the target class labels.
__call__
Once we’ve built the targets, the worst part is over.
def __call__(self, p, targets): # predictions, targets
"""Performs forward pass, calculating class, box, and object loss for given predictions and targets."""
lcls = torch.zeros(1, device=self.device) # class loss
lbox = torch.zeros(1, device=self.device) # box loss
lobj = torch.zeros(1, device=self.device) # object loss
tcls, tbox, indices, anchors = self.build_targets(p, targets) # targets# Losses
for i, pi in enumerate(p): # layer index, layer predictions
b, a, gj, gi = indices[i] # image, anchor, gridy, gridx
tobj = torch.zeros(pi.shape[:4], dtype=pi.dtype, device=self.device) # target obj
n = b.shape[0] # number of targets
if n:
# pxy, pwh, _, pcls = pi[b, a, gj, gi].tensor_split((2, 4, 5), dim=1) # faster, requires torch 1.8.0
pxy, pwh, _, pcls = pi[b, a, gj, gi].split((2, 2, 1, self.nc), 1) # target-subset of predictions
For each prediction layer, we extract the predictions that are responsible for detecting an object. These specific predictions, selected from the entire prediction tensor (pi) using indices calculated in build_targets, are used to compute the box loss, objectness loss, and class loss. The remaining predictions, which are not assigned to a ground truth, will only contribute to the computation of the objectness loss.
Bounding Box Regression Loss
# Regression
pxy = pxy.sigmoid() * 2 - 0.5
pwh = (pwh.sigmoid() * 2) ** 2 * anchors[i]
pbox = torch.cat((pxy, pwh), 1) # predicted box
iou = bbox_iou(pbox, tbox[i], CIoU=True).squeeze() # iou(prediction, target)
lbox += (1.0 - iou).mean() # iou loss
This part is straightforward: we apply the formulas to the bounding box predictions, calculate the CIoU (Complete Intersection over Union), and compute the loss as (1 – CIoU). The final box loss is averaged over the number of built-targets in that layer.
Define Objectness Targets
# Objectness
iou = iou.detach().clamp(0).type(tobj.dtype)
if self.sort_obj_iou:
j = iou.argsort()
b, a, gj, gi, iou = b[j], a[j], gj[j], gi[j], iou[j]
if self.gr < 1:
iou = (1.0 - self.gr) + self.gr * iou
tobj[b, a, gj, gi] = iou # iou ratio
In this part, since self.sort_obj_iou is false and self.gr is always 1.0, we can simplify it to:
# Objectness
iou = iou.detach().clamp(0).type(tobj.dtype)
tobj[b, a, gj, gi] = iou # iou ratio
This part prepares the target objectness score for computing the objectness loss in a later step. We set the target objectness score (tobj) for the predictions that should predict an object to be equal to the CIoU calculated in the previous step.
This could alternatively be set to 1.0, indicating that the model should predict there is an object there. However, by setting it to the CIoU loss, the model predicts how well it thinks the bounding box prediction encloses the target object (tobj[b, a, gj, gi] = iou), instead of simply predicting the presence of an object regardless of the bounding box quality (tobj[b, a, gj, gi] = 1.0). This approach, as mentioned by Glenn Jocher in a GitHub Issue, helps sort out low-accuracy detections during Non-Maximum Suppression (NMS).
Classification Loss
# Classification
if self.nc > 1: # cls loss (only if multiple classes)
t = torch.full_like(pcls, self.cn, device=self.device) # targets
t[range(n), tcls[i]] = self.cp
lcls += self.BCEcls(pcls, t) # BCE
This part is straightforward as well. We apply the binary cross-entropy (BCE) loss to the class predictions. The variable t contains the target binary classes for each object, where 1.0 indicates the object belongs to that class and 0 indicates it does not. Remember, YOLOv5 is designed to predict multi-label objects, meaning an object can belong to multiple classes simultaneously (e.g., a dog and a husky). Similar to the bounding box loss, we average the class loss by summing all contributions and dividing by the number of built-targets and the number of classes. This is achieved using the default ‘mean’ reduction parameter of the BCELoss function.
Objectness Loss
obji = self.BCEobj(pi[..., 4], tobj)
lobj += obji * self.balance[i] # obj loss
The last part is the objectness loss, which involves calculating the binary cross-entropy (BCE) loss between the predicted objectness values and the previously computed target objectness values (0 if no object should be detected and CIoU otherwise). Here, we also average the loss by leaving unchanged the BCE reduction parameter to ‘mean’. Since we use all the predictions from that layer, we sum them and then divide by (batch_size * num_anchors * num_cells_x * num_cells_y). We also apply the corresponding layer objectness loss weight defined in the self.balance variable.
Final Output
Finally, after repeating these steps for each layer and aggregating the loss results, we apply the corresponding weights to each loss component and return the results.
lbox *= self.hyp["box"]
lobj *= self.hyp["obj"]
lcls *= self.hyp["cls"]
bs = tobj.shape[0] # batch sizereturn (lbox + lobj + lcls) * bs, torch.cat((lbox, lobj, lcls)).detach()
This function returns two outputs: the first one is the final aggregated loss, which is scaled by the batch size (bs), and the second one is a tensor with each loss component separated and detached from the PyTorch graph. In the train.py file (line 383), you can see that the former output will be used to backpropagate the gradients, while the latter one is solely for visualization in the progress bar during training and for computing the running mean losses. Therefore, it’s important to bear in mind that the actual loss being used is not the same as what you are visualizing, as the first one is scaled and dependent on the size of each input batch. This distinction can be important when training with dynamic input batch sizes.
After this intensive analysis covering every aspect of the current YOLOv5 loss implementation, a good way to conclude would be to express it in a mathematical formulation. I believe having a mathematical formulation for this loss function, as implemented in the official source code we have just examined, can be valuable.
The following formula represents the loss function for an input batch of B samples/images:
- The “gt” superscript indicates ground truth.
- L is the number of prediction layers, X and Y are the number of cells along each axis, A is the number of anchors, C is the number of classes and B is the batch size.
- Each model prediction (the last dimension of the output tensor in each layer) is in the form (x, y, w, h, obj, *classes). In the formulas, “b” represents the bounding box (x, y, w, h), “O” represents the confidence or objectness score (obj), and “y” represents the array of class predictions (*classes).