YOLO, YOLOv2, and YOLOv3: All You want to know

22 min readAug 3, 2019

During the last few years, Object detection has become one of the hottest areas of computer vision, and many researchers are racing to get the best object detection model. As a result, many state-of-the-art models are under development, such as RCNN, RetinaNet, and YOLO.

In this topic, we’ll dive into one of the most powerful object detection algorithms, You Only Look Once.

Object Detection:

Given an image or a video stream, an object detection model can identify which of a known set of objects might be present and provide information about their positions within the image.

Object detection is different from classification with localization, where we need to classify a single object and determine the location of this object in the image.

Before diving into YOLO, we need to go through some terms:

1-Intersect Over Union (IOU):

IOU can be computed as Area of Intersection divided over Area of Union of two boxes, so IOU must be ≥0 and ≤1.

When predicting bounding boxes, we need the find the IOU between the predicted bounding box and the ground truth box to be ~1.

In the left image, IOU is very low, but in the right image, IOU is ~1.

2- Precision:

Simply we can define precision as the ratio of true positive(true predictions) (TP) and the total number of predicted positives(total predictions). The formula is given as such:

For example, imagine we have 20 images, and we know that there are 120 cars in these 20 images.

Now, let’s suppose we input these images into a model, and it detects 100 cars (here the model said: I’ve found 100 cars in these 20 images, and I’ve drawn bounding boxes around every single car of them).

To calculate the precision of this model, we need to check the 100 boxes the model has drawn, and if we found that 20 of them are incorrect, then the precision will be =80/100=0.8.

Here it is very important to notice that the precision ignores that the actual number of cars is 120.
we can consider the prediction as incorrect if the IOU between the predicted box and the ground truth box is less than the threshold value(0.5,0.75,…).

3-Recall:

If we look at the precision example again, we find that it doesn’t consider the total number of cars in the data (120), so if there are 1000 cars instead of 120 and the model outputs 100 boxes with 80 of them are correct, then the precision will be 0.8 also.

To solve this, we need to define another metric, called the Recall, which is the ratio of true positive(true predictions) and the total of ground truth positives(total number of cars ’120’). The formula is given as such:

For our example, the recall=80/120=0.667.

Now we can notice that the recall measures how well we detect all the objects in the data.

4- Average Precision and Mean Average Precision(mAP):

A brief definition for the Average Precision is the area under the precision-recall curve.

AP combines both precision and recall together. It takes a value between 0 and 1 (higher is better). To get AP =1 we need both the precision and recall to be equal to 1. The mAP is the mean of the AP calculated for all the classes.

YOLO! What a name?

Many object detection systems need to go through the image more than one time to be able to detect all the objects in the image, or it has to go through two stages to detect the objects. YOLO doesn’t need to go through these boring processes. It only needs to look once at the image to detect all the objects and that is why they chose the name (You Only Look Once) and that is actually the reason why YOLO is a very fast model.

YOLO (The first version):

YOLO divides the input image into SxS grid. For example, the image below is divided into 5x5 grid (YOLO actually chose S=7). If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object (we assign the object to the grid cell where the center of the object exists).

YOLO runs a classification and localization problem on each of the 7x7=49 grid cells simultaneously. Since the classification and localization network can detect only one object, that means any grid cell can detect only one object.

Because of this grid idea, YOLO faces some problems:

1-Since we use 7x7 grid, and any grid can detect only one object, the maximum number of objects YOLO can detect is 49.

2-If a grid cell contains more than one object; the model will not be able to detect all of them; this is the problem of close object detection that YOLO suffers from.

3-The object may be located in more than one grid (like the taxi in the image above), so the model may detect the taxi more than one time (in more than one grid), and this problem is solved using non-max suppression, which we will talk about later.

All of the 49 cells are detected simultaneously, and that is why YOLO is considered a very fast model.

Each of the 7x7 grid cells predicts B bounding boxes(YOLO chose B=2), and for each box, the model outputs a conﬁdence score ©. These conﬁdence scores reﬂect how conﬁdent the model is that the box contains an object. Using this score, we can prevent the model from detecting backgrounds, so If no object exists in the cell, the conﬁdence scores should be zero. Otherwise, we want the conﬁdence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

But why do we need C=IOU? Since the ground truth box is drawn by hand we are 100% sure that there is an object inside the ground truth box; accordingly, any box with a high IOU with the truth box will also surround the same object, then the higher the IOU, the higher the possibility that an object occurs inside the predicted box.

Although we have 7x7=49 grid cells, and for each cell, we predict 2 boxes (98 boxes in total); however, the vast majority of these boxes will have very low confidence, then we can get rid of them.

In addition to the confidence score C the model outputs 4 numbers ( (x, y), w , h) to represent the location and the dimensions of the predicted bounding box.

The (x,y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image, so 0<(x,y,w,h)<1.

YOLO was trained to detect 20 different classes of objects (class means :: cat, car, person,….). For any grid cell, the model will output 20 conditional class probabilities, one for each class.

While each grid cell gives us a choice between two bounding boxes, we only have one class probability vector. We will get rid of boxes with low confidence.

Output shape:

The predictions are encoded as S ×S ×(B ∗5 + Classes) tensor. For S=7, B=2 and Classes=20 this will give us a 7x7x30 tensor.

Network Design:

YOLO uses a single convolutional network to simultaneously predict multiple bounding boxes and class probabilities for those boxes. This network is inspired by the GoogleNet model for image classiﬁcation, but instead of the inception modules used by GoogLeNet, YOLO simply uses 1×1 reduction layers followed by 3×3 convolutional layers. It has 24 convolutional layers followed by 2 fully connected layers. As we mentioned above, the ﬁnal output of the network is the 7×7×30 tensor of predictions.

loss function:

YOLO used sum-squared error SSE for the loss function because it is easy to optimize. It tries to optimize the following, multi-part loss:

The first two terms represent the localization loss

Terms 3 & 4 represent the confidence loss

The last term represents the classification loss

we will go through these terms one by one but before that, we need to consider 3 points:
1- The loss function penalizes classiﬁcation error only if there is an object inside that grid cell.

2- Since we have B=2 bounding boxes for each cell we need to choose one of them for the loss and this will be the box that has the highest IOU with the ground truth box so the loss will penalize localization loss if that box is responsible for the ground truth box.

3-SSE weights localization error equally with classiﬁcation error which may not be ideal.

The first term:

This is the SSE between the predicted box coordinates(x,y) and the ground truth coordinates (x^,y^). We sum over all the 49 grid cells in the image and for each cell, we sum over all the B boxes (B=2).

To comply with points (1) & (2) above, YOLO uses a binary variable 1(obj)ij, so that:

1(obj)ij =1 if an object appears in cell i & box j for this cell is responsible for that object, otherwise 0.

“1(obj)ij=1 only if the box contain an object and responsible for detecting this object (higher IOU)“

The box is responsible for detecting an object if it has the higher IOU with the ground truth box between the B boxes.

Since SSE weights localization error equally with classiﬁcation error which may not be ideal as we mentioned in point (3), YOLO uses a constant (λcoord) to give the localization error a higher weight in the loss function (They chose λcoord=5).

The second term:

Here everything is similar to the first term, but we calculate the error in the box dimensions.

why we are using the square root of w and h? Since the 20 classes of objects that YOLO can detect has different sizes & Sum-squared error weights errors in large boxes and small boxes equally. Our error metric should reﬂect that small deviation in large boxes matter less than in small boxes. To partially address this YOLO predicts the square root of the bounding box width and height instead of the width and height directly. For example, instead of predicting 0.9 for a large box and 0.09 for a small box, we predict 0.948 and 0.3 respectively).

Now if the model predicts 2 boxes with the error of 5px in the width of both boxes, we can notice that with the square root we make the square error higher for the small box.

A small error (5px) in a large box is generally benign but the same small error in a small box has a much greater effect.

The third term:

this is the confidence error where:

C^=1

0≤C≤1

The fourth term:

If there is no object in the grid we don’t need to care about the classification and the localization error. All we need to care about is the confidence C(we need our confidence to be zero when there is no object) and for that, we use a variable :

1(noobj)ij = 1 if (there is no object inside cell i) or (there is an object ,but the box j for this cell is not responsible for that object) ,otherwise 0.

Since many grid cells do not contain any object, this pushes the conﬁdence scores of those cells towards zero which is the value of the ground truth conﬁdence (for example 40 of the 49 cells don’t contain objects), This can lead the training to diverge early. To remedy this, we decrease the loss from conﬁdence predictions for boxes that don’t contain objects using the parameter λnoobj =0.5.

The last term:

Here we sum the errors for all the classes probabilities for the 49 grid cells.

Training:

First, they pretrained the convolutional layers of the network for classification on the ImageNet 1000-class competition dataset. For pretraining they used the ﬁrst 20 convolutional layers from the network we talked about previously followed by an average-pooling layer and a 1x1000 fully connected layer with an input size of 224×224 . This network achieved a top-5 accuracy of 88%.

Then they removed the 1x1000 fully connected layer and added four convolutional layers and two fully connected layers with randomly initialized weights and increased the input resolution of the network from 224×224 to 448×448. After that, they trained the model for detection.

Non-maximal suppression:

Since YOLO uses 7x7 grid then if an object occupies more than one grid this object may be detected in more than one grid.

Since we need any object to be detected only once. For example, the taxi in this image may be detected 3 times by the cells with the indexes (3,0), (3,1), and(3,2) where the red box is the ground truth box (here i draw these boxes by hand, actually the taxi may be detected more than 3 times).

So how do we choose one of these boxes?

For each class (cars, pedestrians, cats,….) do:

1-Discard all boxes with confidence C<C -threshold (for example C<0.5)

2- Sort the predictions starting from the highest confidence C.

3-Choose the box with the highest C and output it as a prediction.

4-Discard any box with IOU>IOU-threshold with the box in the previous step.

5-Start again from step (3) until all remaining predictions are checked.

non-max suppression adds 2–3% in mAP.

Fast YOLO and YOLO VGG-16:

Fast YOLO is a fast version of YOLO. It uses 9 convolutional layers instead of 24. It is faster than YOLO but has a lower mAP.

YOLO VGG-16 uses VGG-16 as its backbone instead of the original YOLO network. It is more accurate but slower than real-time.

Comparison to Other Detection Systems:

Real-Time Systems on PASCAL VOC 2007. We can notice that YOLO struggles to localize objects correctly.

Limitations of YOLO:

1-Since each grid cell predicts only two boxes and can only have one class, this limits the number of nearby objects that YOLO can predict, especially for small objects that appear in groups, such as ﬂocks of birds.

2-YOLO can detect only 49 objects.

3-Relatively high localization error.

YOLOv2:

YOLO makes a signiﬁcant number of localization errors. Furthermore, YOLO has a relatively low recall. Thus in the second version of YOLO they focused mainly on improving recall and localization while maintaining classiﬁcation accuracy. To achieve a better performance they used some ideas:

1-BatchNormalization: By adding batch normalization on all the convolutional layers in YOLO they get more than 2% improvement in mAP.

2-High Resolution Classiﬁer: The original YOLO was trained as follow:

i-They trained the classiﬁer network at 224×224 input size.

ii-Then they increased the resolution to 448 for detection.

This means when switching to detection the network has to simultaneously switch to learning object detection and adjust to the new input resolution. While for YOLOv2 they initially trained the model on images at 224×224, then they ﬁne tune the classiﬁcation network at the full 448×448 resolution for 10 epochs on ImageNet before training for detection. This gives the network time to adjust its ﬁlters to work better on higher resolution input. This high-resolution classiﬁcation network gives an increase of almost 4% mAP.

3-Convolutional With Anchor Boxes( multi-object prediction per grid cell):

YOLO (v1) tries to assign the object to the grid cell that contains the middle of the object. Using this idea the red cell in the image above must detect both the man and his necktie, but since any grid cell can only detect one object, a problem will arise here. To solve this, the authors tried to allow the grid cell to detect more than one object using k bounding box.

To predict k bounding boxes YOLOv2 used the idea of Anchor boxes

What is an Anchor Box?

YOLO predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Instead of predicting coordinates directly another object detection model called Faster R-CNN predicts bounding boxes using hand-picked anchor boxes. An anchor box is a width and height. We can predict the bounding box relative to the bounding box instead of predicting the box relative to the image. Using this idea it will be easier for the network to learn. Using only convolutional layers(without fully connected layers) Faster R-CNN predicts offsets and conﬁdences for anchor boxes.

In this image, we have a grid cell(red) and 5 anchor boxes(yellow) with different shapes.

YOLOv2 tries to use the idea of anchor boxes but instead of picking the k anchor boxes by hand, it tries to find the best anchor box shapes to make it easier for the network to learn detection.

In the paper they called the anchor box a (prior box)

In this image, the 5 red boxes represent the average dimensions and locations of objects in VOC 2007 dataset

Someone may ask how and why they chose these 5 boxes? They run k-means clustering on the training set bounding boxes for various values of k and plot the average IOU with the closest centroid, but instead of using Euclidean distance, they used IOU between the bounding box and the centroid.

They chose k = 5 as a good trade-off between model complexity and high recall

YOLOv2 predicts location coordinates relative to the location of the grid cell. This bounds the ground truth to fall between 0 and 1. The network predicts 5 bounding boxes for each cell. It predicts 5 coordinates for each bounding box, tx, ty, tw, th, and to. If the cell is offset from the top left corner of the image by (cx,cy) and the bounding box prior(anchor box) has width and height pw, ph, then the predictions correspond to:

For example, if we use 2 anchor boxes the grid cell(2,2) in the image below will output 2 boxes (the blue and the yellow boxes). Let the black dotted boxes represent the 2 anchor boxes for that cell.

Now consider only the blue box, instead of assigning the predicted blue box to the grid cell only as in YOLO, YOLOv2 assigns the blue box not only to the grid cell but also to one of the anchor boxes and that will be the one that has the highest IOU with the ground truth box. YOLOv2 uses the above equations to assign the blue box to the grid and the anchor box.

Network Architecture:

Darknet-19:

To solve the problems of complexity and accuracy the authors propose a new classiﬁcation model called Darknet-19 to be used as a backbone for YOLOv2.

Darknet-19 has 19 convolutional layers and 5 max-pooling layers. It achieved 91.2% top-5 accuracy on ImageNet which is better than VGG (90%) and YOLO network(88%).

Output shape:

YOLOv2 output shape is 13x13x(k.(1+4+20)) where k is the number of anchor boxes, and 20 is the number of classes. For k=5 the output shape will be 13x13x125.

Training:

The model was first trained for classification then it was trained for detection.

1-Classification: they trained Darknet-19 network on the standard ImageNet 1000 class classiﬁcation dataset with input shape 224x224 for 160 epochs. After that, they fine-tune the network at a large input size 448x448 for 10 epochs. This gives a top-1 accuracy of 76.5% and a top-5 accuracy of 93.3%

2-Detection: After training for classification they removed the last convolutional layer from Darknet-19 and instead they added three 3 × 3 convolutional layers and a 1x1 convolutional layer with the number of outputs we need for detection(13x13x125). Also, a passthrough layer was added so that our model can use ﬁne grain features from previous layers.

Then they trained the network for 160 epochs on detection datasets (VOC and COCO datasets).

Multi-Scale Training:

To make YOLOv2 robust to running on images of different sizes they trained the model for different input sizes.ٍ Sٍٍٍince the model uses only convolutional and pooling layers the input can be resized on the ﬂy.

Instead of ﬁxing the input image size, they changed the network every few iterations. After every 10 batches, the network randomly chooses a new image dimension size from the dimensions set {320,352,384,…,608}. Then they resize the network to that dimension and continue training. This means the same network can predict objects at different resolutions(input shapes).

Comparison to Other Detection Systems:

YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth trade-off between speed and accuracy.

PASCAL VOC2012 test detection results: YOLOv2 performs on par with state-of-the-art detectors like Faster R-CNN with ResNet and SSD512 and is 2−10×faster.

YOLO9000:

Sometimes we need a model that can detect more than 20 classes, and that is what YOLO9000 does. It is a real-time framework for detecting more than 9000 object categories by jointly optimizing detection and classiﬁcation.

As we mentioned previously, YOLOv2 was trained for classification then for detection. This is because the dataset for classification -which contains one object- is different from the dataset for detection. In YOLOv2, the authors propose a mechanism for jointly training on classiﬁcation and detection data.

During training, they mix images from both detection and classiﬁcation datasets. When the network sees an image labeled for detection, we can backpropagate based on the full YOLOv2 loss function. When it sees a classiﬁcation image we only backpropagate loss from the classiﬁcation speciﬁc parts of the architecture.

The idea of mixing detection and classification data faces a few challenges:

1-Detection datasets are small compared to classification datasets.

2-Detection datasets have only common objects and general labels, like “dog” or “boat”, while Classiﬁcation datasets have a much wider and deeper range of labels. For example, ImageNet dataset has more than a hundred breeds of dogs like “german shepherd” and “Bedlington terrier.”

To merge these two datasets the authors created a hierarchical model of visual concepts and called it WordTree.

As we see, all the classes are under the root (physical object). They trained the Darknet-19 model on WordTree .They extracted the 1000 classes of ImageNet dataset from WordTree and added to it all the intermediate nodes, which expands the label space from 1000 to 1369 and called it WordTree1k.Now the size of the output layer of darknet-19 became 1369 instead of 1000.

For these 1369 predictions, we don’t compute one softmax, but we compute a separate softmax overall synsets that are hyponyms of the same concept.

Despite adding 369 additional concepts Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5 accuracy.

The detector predicts a bounding box and the tree of probabilities, but since we use more than one softmax we need to traverse the tree to find the predicted class.

We traverse the tree from top to down, taking the highest conﬁdence path at every split until we reach a node with probability < threshold-probability then we predict that object class.

For example, if the input image contains a dog, the tree of probabilities will be like this tree below:

Instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the value of Pr(physical object), which is the root of the tree.

The model outputs a softmax for each branch level. We choose the node with the highest probability(if it is higher than a threshold value)as we move from top to down. The prediction will be the node where we stop.

In the tree above, the model will go through physical object => dog=>hunting dog. It will stop at ‘hunting dog’ and do not go down to sighthound (a type of hunting dogs) because its confidence is less than the confidence threshold value, so the model will predict hunting dog not sighthound.

Performing classiﬁcation in this manner also has some beneﬁts. Performance degrades gracefully on new or unknown object categories. For example, if the network sees a picture of a dog but it is uncertain which type of dog it is, it will stop at the dog with high conﬁdence and the output will be (dog).

The combined dataset was created using the COCO detection dataset and the top 9000 classes from the full ImageNet release.YOLO9000 uses three priors(anchor boxes) instead of 5 to limit the output size. It learns to ﬁnd objects in images using the detection data in the COCO dataset and it learns to classify a wide variety of these objects using data from the ImageNet dataset.

When the network sees a detection image, we backpropagate loss as normal. When it sees a classiﬁcation image we only backpropagate classiﬁcation loss.

Since COCO does not have a bounding box label for many categories, YOLO9000 struggles to model some categories like “sunglasses” or “swimming trunks.”

YOLOv3:

It’s a little bigger but more accurate.

Bounding Box Prediction:

Same as YOLO9000, the network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx,cy) and the bounding box prior has width and height Pw, Ph, then the predictions correspond to:

YOLOv3 also predicts an objectness score(confidence) for each bounding box using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. For example (prior 1) overlaps the first ground truth object more than any other bounding box prior (has the highest IOU) and prior 2 overlaps the second ground truth object by more than any other bounding box prior. The system only assigns one bounding box prior to each ground truth object. If a bounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

If the box does not have the highest IOU but does overlap a ground truth object by more than some threshold we ignore the prediction (They use the threshold of 0.5).

Multi labels prediction:

In some datasets like the Open Image Dataset an object may have multi labels. For example, an object can be labeled as a woman and as a person.

In this dataset, there are many overlapping labels. Using a softmax for class prediction imposes the assumption that each box has exactly one class, which is often not the case(as in Open Image Dataset).

For this reason, YOLOv3 does not use a softmax; instead, it simply uses independent logistic classiﬁers for any class. During training, they used binary cross-entropy loss for the class predictions.

Using independent logistic classiﬁers, an object can be detected as a woman an as a person at the same time.

Small objects detection:

YOLO struggles with small objects. However, with YOLOv3 we see better performance for small objects, and that is because of using short-cut connections. Using these connections method allows us to get more ﬁner-grained information from the earlier feature map. However comparing to the previous version, YOLOv3 has worse performance on medium and larger size objects.

Feature Extractor Network (Darknet-53):

YOLOv3 uses a new network for performing feature extraction. The new network is a hybrid approach between the network used in YOLOv2 (Darknet-19) and the residual network, so it has some shortcut connections. It has 53 convolutional layers so they call it Darknet-53.

Darknet-53 performs on par with state-of-the-art classiﬁers but with fewer ﬂoating point operations and more speed.

After training on classification the fully connected layer is removed from Darknet-53.

Predictions Across Scales:

Unlike YOLO and YOLO2, which predict the output at the last layer, YOLOv3 predicts boxes at 3 different scales as illustrated in the below image.

This is a simple diagram for the network .I didn’t draw the short cut connections for simplicity.

At each scale, YOLOv3 uses 3 anchor boxes and predicts 3 boxes for any grid cell. Each object is still only assigned to one grid cell in one detection tensor.

Performance:

When we plot accuracy vs. speed on the AP50 (IOU 0.5 metric), we see that YOLOv3 has signiﬁcant beneﬁts over other detection systems.

However, YOLOv3 performance drops signiﬁcantly as the IOU threshold increases (IOU =0.75), indicating that YOLOv3 struggles to get the boxes perfectly aligned with the object, but it still faster than other methods.

Now I will let you with this video from YOLO website:

https://pjreddie.com/darknet/yolo/

Try it yourself with a pre-trained model:

The original YOLO model was written in Darknet, an open-source neural network framework written in C and CUDA. It supports CPU and GPU computation. You can follow this link to install Darknet and the pre-trained weights.

For windows, you can also use darkflow which is a tensorflow implementation of darknet, but Darkflow doesn’t offer an implementation for YOLOv3 yet.

If you are interested to run YOLO without GPU, you can read about YOLO-lite, which is a real-time object detection model developed to run on portable devices such as a laptop or cellphone lacking a GPU.

You can also visit this github repository to learn about tiny-YOLO to use YOLO for cellphones.

Resources:

YOLO: https://arxiv.org/pdf/1506.02640.pdf

YOLOv2 and YOLO9000: https://arxiv.org/pdf/1612.08242.pdf

YOLOv3:https://arxiv.org/pdf/1804.02767.pdfYOLO, YOLOv2 and

YOLO, YOLOv2, and YOLOv3: All You want to know

Object Detection:

YOLO! What a name?

YOLO (The first version):

Output shape:

Network Design:

loss function:

The first term:

The second term:

The third term:

The fourth term:

The last term:

Training:

Non-maximal suppression:

Fast YOLO and YOLO VGG-16:

Comparison to Other Detection Systems:

Limitations of YOLO:

YOLOv2:

Network Architecture:

Output shape:

Training:

Comparison to Other Detection Systems:

YOLO9000:

YOLOv3:

Bounding Box Prediction:

Multi labels prediction:

Small objects detection:

Feature Extractor Network (Darknet-53):

Predictions Across Scales:

Performance:

Try it yourself with a pre-trained model:

Resources:

Written by Amro Kamal

Responses (7)