September 22, 2022

# Instance segmentation loss functions Instance segmentation aims to generate a binary mask for each single object detected in the scene. There are two main categories of instance segmentation algorithms: detection base instance segmentation algorithms and single-shot instance segmentation algorithms. The first approach generates higher quality masks, however the latest one is faster.

In the 1st article in the series, Instance segmentation algorithms overview, we dive deep into types and approaches used to perform instance segmentation and the 2nd part, Instance segmentation evaluation criteria, presents evaluation metrics used by instance segmentation models. This article, in turn, focuses on loss functions used to train the instance segmentation algorithms. The most commonly used is the focal loss, however, it is not the only one out there.

For instance segmentation tasks, we can use the following loss functions:

• Weighted binary cross-entropy loss
• Focal Loss
• Dice Loss
• Generalized IoU
• Boundary loss
• Lovasz softmax loss

## Weighted binary-cross entropy loss

It is a binary cross-entropy loss, which handles class imbalance. Let’s imagine a seagull instance segmentation model. The input and an output to the model are presented in Figure 1. Figure 1. The left image is an input to the instance segmentation network and the right image is a segmentation mask obtained as an output of the instance segmentation network.

The network predicts black pixels as the background class (0) and white pictures as the object class (1). In the following scenario, the network poorly predicted the mask of the seagull. However, if we calculate the loss using standard binary cross-entropy the loss would be low (close to zero), which means that the network performs well. Why is that?

$$BCE=-\frac{1}{N}\sum_{i=0}^{N}\left [ y_{i}log(p_{i})+(1-y_{i})log(1-p_{i}) \right ]$$

It is because of a huge class imbalance. The number of black pixels (negative class) is significantly higher than the number of white pixels (positive class). Because most of the black pixels were correctly classified as a background, then the overall loss is small and the model thinks it performs well. To handle class imbalance we can use weighted binary cross-entropy (WBCE). WBCE takes into account the number of instances in the given class.

$$WBCE=\frac{1}{N}\sum_{i=0}^{N}\left [ \alpha y_{i}log(p_{i})+(1\alpha )(1-y_{i})log(1-p_{i}) \right ]$$
The $$\alpha$$ parameter is a class weight. In the case of a seagull example, the $$\alpha$$ value should be: $$\frac{average\hspace{5px}number\hspace{5px}of\hspace{5px}black\hspace{5px}pixels\hspace{5px}in\hspace{5px}the\hspace{5px}image}{average\hspace{5px}number\hspace{5px}of\hspace{5px}all\hspace{5px}pixels\hspace{5px}in\hspace{5px}the\hspace{5px}image}=\sim0.95$$

Incorporating a weight parameter into the binary cross entropy favored class with small number of instances. In the given example, the white pixels class is the one with a small number of instances, so the weighted BCE loss will put more attention to correctly classifying white pixels and less attention to correctly classifying black pixels. Therefore the loss in the given example would be high, even though most of the black pixels were classified correctly.

## Focal Loss

Focal loss is an improved version of binary-cross entropy loss, which tries to handle the class imbalance problem plus additionally adds a gamma parameter, which focuses on hard to classify cases.

$$FL(p_{t})=\left\{\begin{matrix} -\alpha _{t}(1-p_{t})^{\gamma }log(p_{t}) & ,y=1 \\ (1\alpha _{t})p_{t}^{\gamma }log(1-p_{t}) & otherwise \\ \end{matrix}\right.$$
where $$\alpha$$ is a parameter handling class disbalance. $$\alpha$$ is is higher for a class with fewer instances and lower for a class with many instances. $$(1-p_{t})^{\gamma}$$ handles harder and easier samples and $$p_{t}$$is the predicted probability of an object belonging to the given class. $$\gamma$$ is a parameter that puts more focus on harder to classify examples if higher.
If we look at equation 3, it looks very similar to the weighted binary cross entropy equation 2. The $$\alpha$$ parameter works exactly the same way in both equations. It is the weight parameter that handles class imbalance. $$\alpha$$ would be high for white pixels and small for black pixels in case of the seagull example. Without parameter $$\gamma$$, equation 3 would be identical to equation 2. The additional $$\gamma$$ parameter handles balance between hard and easy to classify classes.

It focuses more on those hard to classify, giving them more attention and reducing loss for easy to classify classes. Let’s imagine an object classification task where we want to correctly classify dogs, wolves, and planes. While distinguishing between a dog and a plane or a wolf and a plane is fairly easy, the model may struggle to achieve good results on identifying a dog and a wolf as separate objects. Here comes the power of focal loss. The part presented above allows to even further increase the loss if the model predicts low probability for the correct class label, and even further decrease the loss if the predicted probability for the correct class is high.

The influence of gamma on the loss function is best illustrated by figure 2. It shows that the loss value is squeezed toward zero for high confidence correct predictions and the loss is higher for low probability values obtained for correct classes.

The gamma parameters control the shape of the curve.

The focal loss introduces one new hyperparameter, the focusing parameter γ, that controls the strength of the modulating term. When γ = 0, our loss is equivalent to the CE loss. As γ increases, the shape of the loss changes so that “easy” examples with low loss get further discounted, see Figure 1. FL shows large gains over CE as γ is increased. With γ = 2, FL yields a 2.9 AP improvement over the α-balanced CE loss.
The higher the $$\gamma$$ parameter value, the lower the loss for well-classified examples, so we could turn the attention of the model more towards ‘hard-to-classify’ examples.

Focal loss can be easily adapted to the instance segmentation scenario. The weight (alpha) parameter handles white-black pixels class imbalance the same way as the weighted binary cross entropy does. The gamma parameter allows focusing more on hard to classify pixels. For example, correctly classifying pixels close to the center of an object is a fairly easy task. Thus, the network will predict the center object’s pixels as white with high confidence. On the other hand, classifying edge pixels is challenging. Those pixels may be classified as belonging to the same object or not. The model should put more attention to those pixels in the learning process. Focal loss does it with the gamma term.

## Dice Loss

Dice loss is widely used in medical image segmentation tasks. It tackles the problem of class imbalance. The dice loss formula is given with the following equation:

$$D_{loss}=1-DSC$$

where the DSC is a Dice coefficient given by the equation:

$$DSC= \frac{2*sum_{over \hspace{2px}i}(p_{i}*g_{i})}{sum_{over\hspace{2px}i}\hspace{4px}(p_{i})+sum(g_{i})}$$ or $$DSC= \frac{2*sum_{over \hspace{2px}i}(p_{i}*g_{i})}{sum_{over\hspace{2px}i}\hspace{4px}(p_{i}^{2})+sum(g_{i}^{2})}$$ where $$p_{i}$$ and $$g_{i}$$ represent predicted and ground truth pixels correspondingly. Pixel values can have values of either 0 or 1.

Dice loss is very similar to IoU. It is the area of overlap divided by the total area of predicted and ground truth shape. The main difference is in the denominator. IoU uses the area union and DSC uses the area sum. The DSC is equal to 1 if two areas overlap perfectly and it is equal to 0 if two areas do not overlap. To make it a valid loss, we just do 1 - DSC and, therefore, by minimizing a given loss function, we can train the model. The main disadvantage of dice loss is that it is equal to zero independently of how far away from each other the ground truth and predicted pixels are.

## Generalized IoU

IoU measures the overlapping area of two bounding boxes or, generally speaking, shapes and normalizes it by the common area. IoU is successfully used for image evaluation, however, it has problems when it comes to the loss application. IoU is equal to zero if there is no overlap between the ground truth and the predicted bbox. It means that it doesn’t matter if two instances of bbox are away just by one pixel or are away by 100 pixels - the loss would be zero anyway. It makes IoU useless for loss applications. What’s more, if there is no intersection between the ground truth and a predicted mask, the IoU has no value and thus no gradient. However, the loss function has to be fully differentiable to allow backpropagation. To overcome this problem, Stanford researchers proposed a generalized IoU loss, which is fully differentiable.

$$GIoU=\frac{|A\cap B|}{|A\cup B|}-\frac{|C/(A\cup B)|}{|C|}=IoU=\frac{|C/(A\cup B)|}{|C|}$$ where A and B are: predicted area and ground truth area accordingly and C is the smallest convex hull enclosing A and B.

Adding the second term to the standard IoU ensures that the loss is smaller when two bounding boxes approach each other and greater when they are far away from each other. The GIoU ranges from -1 to 1. Negative values occur when C (the area enclosing both bounding boxes) is greater than IoU. As the IoU component increases, the value of GIoU converges to IoU.

A loss function to be employed for IoU would be described with the following equation:

$$L_{IoU}=1-IoU$$ Same way, the loss for the GIoU can be written with an equation: $$L_{GIoU}=1-GIoU$$

For multilabel datasets, GIoU is commonly averaged across classes, yielding the mean GIoU (mGIoU). You can find more information about generalized IoU here.

## Boundary loss

Dice or cross-entropy are based on integrals over the segmentation regions. Unfortunately, for highly imbalanced segmentations, such regional summations have values that differ by several orders of magnitude across classes, which affects training performance and stability. One way to handle it is to add class weights to the loss. Classes with fewer occurrences receive more attention and classes with many occurrences - less attention. The other approach is adopted by boundary loss. Boundary loss uses the summation over the boundary regions rather than summation over the overlapping regions. In other words, the distance between ground truth contour and the predicted contour is taken into account rather than masks overlapping area.

Figure 4 presents the idea of boundary loss. The core concept is to compute a distance $$Dist(\partial G,\partial S)$$, where $$\partial G$$ is a ground truth mask contour and $$\partial S$$ is the predicted mask contour. However, to make it differentiable and easily implementable, the loss is rewritten as an integral over the distance field $$\Delta S$$ between the ground truth boundary $$\partial G$$ and the predicted boundary $$\partial S$$. Each point in the $$\Delta S$$ field is a distance to the boundary $$\partial G$$. Negative if the point q is inside the G area and positive if point q is outside the G area. By using contours regions rather than overlap regions, the loss handles even highly imbalanced classes in the image segmentation tasks.

### Boundary loss computation example

Let’s go through an example presented in table 1 in order to understand how boundary loss actually works.

The value obtained in step 4 is the final boundary loss. And we could stop here.

However, the obtained result is not exactly the integral over the $$\Delta S$$, rather the integral over the S area, where each point within S is a distance to the boundary $$\partial G$$, positive if the point lies outside G and negative if point lies inside G. And it is fine, but why is that?

Let’s write the boundary loss with mathematical equations.

$$\frac{1}{2}Dist(\partial G, \partial S)= \int_{\Omega}\phi _{G}(q)s(q)dq-\int_{\Omega}\phi _{G}(q)g(q)dq$$ $$\phi G(q)=\left\{\begin{matrix} -D_{G}(q) & if \hspace{5px}q\epsilon G \\ D_{G}(q) & otherwise \\ \end{matrix}\right.$$ where s(q) is equal to one if it is inside the S area and zero otherwise. $$s_{\theta}(q)$$ is a replacement of the binary s(q) value with the probability output of the softmax function

The first integral in the first of the above equations corresponds to the multiplication of the predicted (orange) mask with the ground truth distance field and it is the final boundary loss value. The second integral in the first of the above equations subtracts the distance field inside the G area.

If we keep the second therm, we would actually obtain the integral over the $$\Delta S$$ area.

However, the integral over distances inside the G area is constant and independent of network parameters and thus can be omitted. The simplified boundary loss without the last term is presented in the equation below. It is exactly equal to element wise multiplication of the predicted orange mask with the pre-computed ground-truth distances mask in the example.

$$L_{B}(\theta )=\int _{\Omega}\phi _{G}(q)s_{\theta}(q)dq$$ The level set function $$\phi _{G}$$ is pre-computed directly from the ground truth region G during the dataset generation step.

The boundary loss described by the equation above is minimized (archives the minimum value), when all negative values in a distance function are included in the sum (i.e., the softmax predictions for the pixels within the ground truth foreground are equal to 1).

The boundary loss can be be easily combined with standard regional losses (LR), like for example Dice loss - and actually, it usually is. There is a very trivial solution that the network may be stuck in. If there is an empty foreground prediction, so approximately all values of the softmax probabilities are nulls, the network has very low gradients. Therefore, this trivial solution is close to a local minimum or a saddle point. To avoid it, authors of the boundary loss suggest combining it with the regional based loss. The regional based loss is the most important at the beginning of training. As the training process progresses, the boundary loss term starts to take over.

$$L_{R}(\theta) + \alpha L_{B}(\theta)$$ where $$\alpha \epsilon R$$ is a parameter balancing the two losses.

## Lovasz softmax loss

If pt is equal to 1, the log(pt) is equal to 0 and the loss is zero. When pt approaches 1, the log(pt) is going to minus infinity. For negative classes, when we want the predicted pixel to be equal to 0, we use an “otherwise” equation. In the image segmentation task, pixels that we want to include as the mask pixels are labeled as one and all other pixels (background) are labeled as 0. The image segmentation task is a good example of imbalance classes classification problem. The number of background pixels is usually significantly larger than the number of object pixels. Therefore, weighted binary cross-entropy or focal loss are good choices for the object segmentation loss function.

$$FL(pt) = \begin{cases} -\alpha_{t}\left(1-p_{t}\right)^{\gamma} \log \left(p_{t}\right) & , y=1\\ -\left(1-\alpha_{t}\right) p_{t}^{\gamma} \log \left(1-p_{t}\right) & \text{otherwise}\\ \end{cases}$$

## Summary

The main challenge in front of image segmentation loss functions is to handle class imbalance in the loss computation properly. The class belonging pixels usually occupy the small sub-part of the image and therefore the standard loss functions are highly biased toward the proper classification of the image background, not the foreground. The article walked through a few loss functions that handle class imbalance. For more information, please check out the links attached to each loss. Under those links, you can find the original scientific papers describing each loss. Please also check our articles about Image Segmentation and Image Segmentation metrics to better grasp the Image Segmentation task.