Residual Blocks & ResNets

8 min readApr 4, 2021

When deep neural networks started to boom in 2012, after the disclosure of AlexNet (the winner of ILSVRC 2012), the common belief was training a deeper neural network (increasing the number of layers) will always increase the performance of the network, and a lot of researches showed that the depth of neural networks is a crucial ingredient for their success.
Since 2012, network depth has played one of the most important roles in the successes of deep learning.

AlexNet, with only 5 layers, achieved 17% top-5 error in ImageNet dataset. In 2014 the Visual Geometry Group at Oxford trained the VGG networks with 16 and 19 layers (7.3% top-5 error) following a similar architecture to AlexNet. In 2014 also, Google revealed its first inception model, Google Net (6.7% top-5 error), with 22 layers.

We can see that the error decreased from 16.4% for 5-layers network to 6.7% for 22-layers network.

Indeed the network depth was not the only factor in increasing Conv nets performance as a lot of training tricks and method have been invented during this period, but network depth played and important role.

The question now is, Is learning better networks as easy as stacking more layers? Is it just a matter of adding more and more layers?

In 2015 researchers from Microsoft answered this question by training two networks on CIFAR-10 dataset, following the same general architecture for both, one with 20 layers and the other with 56 layers. The results, which look surprising, show that the network with the 20 layers outperforms the network with the 56 layers (similar results were achieved on ImageNet dataset).

Figure 1 : training and test error for the 20-layers and 56-layers an CIFAR-10 dataset. The deeper network (red) has a higher error (image source: paper)

The Degradation Problem:

These results show that getting a lower error is not just about adding additional layers to the network. As we see in the experiment, it seems that the shallower networks are learning better than their deeper counterparts and this is quite counter-intuitive. The phenomenon of having lower accuracy when increasing the number of layers is called the degradation problem. The figure below shows an approximation of how a conv net will perform when we go deeper by adding more layers to it.

From the figure, we can see that when the network depth increases, the accuracy gets saturated and then degrades rapidly. This degradation in the accuracy happens when the model depth (number of layers) exceeds its limit.

Also, we need to emphasize that this problem is not an overfitting problem since overfitting will make the training error very low, and the test error high, but here the even the training error is increasing (underfitting).

The solution:

Now we need to find a way to optimize very deep networks and the goal of this solution is to make networks of hundreds of layers more accurate than shallower networks.

In 2015 a research group from Microsoft presented a solution to the degradation problem by introducing what is called ‘deep residual learning framework’.

They introduced the idea of directly connecting the input of layer (L) with the input of layer (L+k), skipping k layers in-between. This connection is called residual connection, skip connection, or short-cut connection. It simply performs identity mapping, which is added to the output of the stacked layers before applying the activation function.

The block of the k layers and the skip connection is called residual block. Usually the number of layer inside the skip connection is two or three.

Figure 3: The architecture of VGG 19, 34-layers network without residual connections, and resnet-34 (image source: paper)

We can understand the importance of these skip connections using different viewpoints:

1-Simplifying the Learning of the layers inside the connection:

Let us consider H(x) as an underlying mapping to be ﬁt by a few stacked layers (let’s say 2 layers) with x denoting the inputs to the ﬁrst layer of these layers. Instead of hoping these two layers (together) to directly fit the desired underlying mapping H(x), we explicitly let these layers fit a residual mapping H(x)-x, by map x directly from the input layers to the output layer as illustrated in the next figure.

Figure 4: Left: regular layers block, Right: Residual Block with two layers: The two layers between the skip connection will learn the residual mapping H(x)-x instead of learning H(x)

This residual learning framework is based on a hypothesize presented by the authors claims that it is easier to optimize the residual mapping H(x)-x than to optimize the original mapping H(x). The paper didn’t provide sufficient evidence for that, but in practice, residual mapping is often easier to optimize.

This is how they explained this idea in the paper, although a better explanation could be what we are going to discuss in point (2) and (3) below.

2-Fighting Vanishing Gradient:

IFor a network with N layers we use the gradient chain rule to update the weights of each layer using the formula :

Wi= Wi -lr * dL/dWi

where lr is the learning rate and dL/dWi is the gradient of the loss with respect to the weights at layer i

Consider the 3-layers network in the diagram, where X is the input Wi is the weights in layer i, Zi is the output of layer i before the activation and Ai is the output after applying the activation function.

To find the derivative of the loss with respect to the W2 we just follow the arrows :

dL/dW2= (dL/dA3 x dA3/dZ3) x (dZ3/dA2 x dA2/dZ2) x dZ3/dW2

This is a straightforward implementation of the chain rule. What is important here is that the (length) of this chain can be very long for updating the first layers in the deep network. This long chain of multiplication can cause problems if the numbers we are multiplying have small or large magnitudes. In this case, the result will be very small or very big number.

Remenber that we need this number (gradient) to update the weights.

Usually in neural networks, the elements in the chain are small numbers. When the result of multiplying the chain elements is too small either the computer will not be able to represent the number or the training will be very slow (remember that Wi= Wi -lr * dL/dWi). This problem is very common for deep networks. Since the gradient becomes very small, the problem is called the Vanishing Gradient problem.

If the gradient has very high magnitude, the network will face what is called exploding gradient.

The next figure shows how the gradient starts to vanish as we go back from the deeper layers (layer 7) to the shallower layers (layer 1).

But when we do backpropagation after adding the skip connections, the gradient will pass to the weights of the layer through two paths which will result in reducing the probability of having a vanishing gradient. If the gradient in path-1 starts to vanish adding the gradient from path-2 to it will reduce the probability of this new gradient vanishing.

Although the paper didn’t point at vanishing gradient as the main reason for the degradation problem, it is still one of the common problems that appear when we increase the depth of the networks.

3-Simplifying the model:

By using the skip connection we still can achieve a simple model (shallow model) even after stacking a lot of layers. If the model can find weights that make H(x)-x =0, this will act as the two layers between the connections don’t exist and this will be as we are using a shallower network.

Residual Networks:

Utilizing the idea of residual connections the authors trained some networks and called them ResNets. RestNets has a skip connection every 2 or 3 layers. Using a sequence of these residual blocks they trained very deep networks with more than 150 layers. The paper presents 4 version of ResNet with different number of layers, RestNet-34 (34 layers), RestNet-50 (50 layers), RestNet-101 (101 layers), and RestNet-152(152 layers).

The table below shows the top-1 and top-5 errors on ImageNet validation dataset for RestNets compared to the other famous networks like VGG and GoogleNet. We can see that ResNet had the lowest error at that time (2015).

Figure 5: ImageNet error (image source: paper)

The table is up to 2015 (the date they published the paper) but today the idea of residual connections is used widely in most of the architectures after 2015.

The graphs in figure 6 show the training error of the same base architectures with and without residual connection. We can see that without residual connection (left) the 18-layers network outperforms the 34-layers networks and both of them can not exceed 30% error.

When using residual connections (right) the 34-layers networks can easily outperform the 18-layers networks. We can see that using the residual connection the 34-layers network exceeded the 30% error rate.

Residual Networks Architecture:

I will not talk in detail about the architecture of every network since there is nothing complex in the architecture. Generally, RestNets have fewer filters and lower complexity compared to VGG nets. For example, ResNet-34 has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

Also, we can notice that they didn’t use pooling layers (except after the first conv layer) and they used a stride of 2 instead of that. Also, they used a global average pooling after the last conv layer to reduce the number of FC layers they need.

The table below summarizes the architecture of different versions of ResNet: