The Residual Network (ResNet) Standard: Unterschied zwischen den Versionen

Aktuelle Version vom 17. Februar 2026, 07:25 Uhr

ResNet, also called Residual Network, is a very important model in deep learning history. Before ResNet, training very deep neural networks was really hard because of the vanishing gradient problem. This means the signal used to update the weights becomes very small as it moves backward through many layers.

ResNet fixed this problem by adding skip connections, also known as shortcuts. These connections let the gradient flow around some layers instead of passing through all of them. Because of this, very deep networks can be trained, even with hundreds of layers, like ResNet-50, ResNet-101, and ResNet-152.^[1]

MATLAB Usage

It is easy to use in MATLAB with resnet50. Many people use it as the default option for transfer learning tests, because its behavior is well known and already understood.^[2]

The Math Behind Vanishing Gradients and Training Problems

The main problem when training very deep neural networks is called the vanishing gradient problem. This happens when the gradients from backpropagation get smaller and smaller as they move to the earlier layers of the network. Because of this, the first layers almost stop learning.

This happens because of the chain rule in math. During backpropagation, the gradient in each layer is multiplied by gradients from the layers after it. So it becomes like many small numbers multiplied together. When you multiply many small numbers, the result gets very tiny.

If the network use activation functions that can saturate, or even normal non-linear functions with certain weight setups, the gradients shrink very fast. They can become close to zero in an exponential way. When this happen, the early layers in the model cannot learn properly, and the whole training process slow down a lot or even stop working good.^[3]

Scientists found something interesting when they were creating the ResNet model. They saw a problem called the **degradation problem**. At first, people thought very deep networks fail only because of vanishing gradients. But even when they fixed the gradient problem using good weight setup and something called Batch Normalization (which helps signals and gradients stay normal size), very deep networks still did worse than smaller ones.

For example, a 34-layer network had higher training error than a 18-layer network. This is strange because the 34-layer network should be able to do everything the 18-layer network can do, and even more. But in real experiments, it trained worse and had more test error too.

This idea is sometimes called the "ResNet Hypothesis." It shows that the problem is not just vanishing gradients. The real issue is that very deep networks are very hard to optimize. The learning process becomes very slow and difficult, and the model dont converge easily in such a complex and non-linear space. ^[3]

Residual Learning

To fix this optimization problem, ResNet introduced something called residual learning using skip, or shortcut, connections.

In a normal deep network, many layers are stacked together and they try to learn a function $H (x)$ directly from the input $x$ . The network must learn the full mapping by itself, which can be very hard when the network is very deep.

ResNet changes this idea. It adds a skip connection that jumps over one or more layers. Because of this shortcut, the network does not try to learn $H (x)$ directly. Instead, it learns a residual function:

F (x) : = H (x) - x

The output of the block becomes:

Y = F (x) + x

So instead of learning everything from zero, the layers only learn the difference between the input and the output.

The main advantage is about optimization. If the best mapping is close to just passing the input forward (an identity mapping), it is much easier for the network to make the nonlinear layers learn $F (x) = 0$ . This can happen by pushing their weights close to zero. It is more easy than trying to force many nonlinear layers to behave exactly like an identity function.

Because of this, adding more layers should not make the network worse. In theory, a deeper network can always perform at least as good as a shallow one, because it can just learn $F (x) = 0$ and copy the shallow network behavior. But in practice, it still may need good training to make it work proper.^[4]

This identity mapping means the original input $x$ is added directly to the output of the next layers. Because of this, there is a second path where the signal can move forward easily, and the gradient can also move backward without problem.

During backpropagation, the gradient of the loss $L$ with respect to the input $x$ has an extra added term. Since it is added (not multiplied many times like in normal deep networks), it avoids the problem where gradients become very small. In normal networks, gradients can shrink again and again because of multiplication, and they almost disappear. But here, the loss from the final output is still strongly felt by the first layers. So the early layers can learn better, and training becomes more stable.

Sometimes, inside the network, the size of the feature maps changes. For example, the spatial size can become smaller, or the number of channels can increase. In this case, the skip connection cannot just add $x$ directly because the dimensions are different. So ResNet uses a small $1 \times 1$ convolution layer in the shortcut path to adjust the dimensions. This makes sure both feature maps have the same shape, so they can be added together without any mismatch. This help the network work properly even when the size changes.

References

↑ MathWorks. Train Residual Network for Image Classification. [Online]. Available at: https://www.mathworks.com/help/deeplearning/ug/train-residual-network-for-image-classification.html
↑ MathWorks. Pretrained Convolutional Neural Networks. MATLAB Documentation. Available at: https://www.mathworks.com/help/deeplearning/ug/pretrained-convolutional-neural-networks.html
↑ ^3,0 ^3,1 Xu, G., Wang, X., Wu, X., Leng, X. and Xu, Y., 2024. Development of skip connection in deep neural networks for computer vision and medical image analysis: A survey. arXiv preprint arXiv:2405.01725.
↑ Sandushi W. Understanding ResNet-50: Solving the Vanishing Gradient Problem with Skip Connections. Medium. Available at: https://medium.com/@sandushiw98/understanding-resnet-50-solving-the-vanishing-gradient-problem-with-skip-connections-5591fcb7ff74

[1] MathWorks. Train Residual Network for Image Classification. [Online]. Available at: https://www.mathworks.com/help/deeplearning/ug/train-residual-network-for-image-classification.html

[2] MathWorks. Pretrained Convolutional Neural Networks. MATLAB Documentation. Available at: https://www.mathworks.com/help/deeplearning/ug/pretrained-convolutional-neural-networks.html

[Xu2024-3] 3,0 ^3,1 Xu, G., Wang, X., Wu, X., Leng, X. and Xu, Y., 2024. Development of skip connection in deep neural networks for computer vision and medical image analysis: A survey. arXiv preprint arXiv:2405.01725.

[4] Sandushi W. Understanding ResNet-50: Solving the Vanishing Gradient Problem with Skip Connections. Medium. Available at: https://medium.com/@sandushiw98/understanding-resnet-50-solving-the-vanishing-gradient-problem-with-skip-connections-5591fcb7ff74

[1]

[2]

[3]

[4]

@@ Zeile 19: / Zeile 19: @@
 This idea is sometimes called the "ResNet Hypothesis." It shows that the problem is not just vanishing gradients. The real issue is that very deep networks are very hard to optimize. The learning process becomes very slow and difficult, and the model dont converge easily in such a complex and non-linear space.
 <ref name="Xu2024">Xu, G., Wang, X., Wu, X., Leng, X. and Xu, Y., 2024. Development of skip connection in deep neural networks for computer vision and medical image analysis: A survey. arXiv preprint arXiv:2405.01725.</ref>
+== Residual Learning ==
+To fix this optimization problem, ResNet introduced something called '''residual learning''' using skip, or shortcut, connections.
+In a normal deep network, many layers are stacked together and they try to learn a function <math>H(x)</math> directly from the input <math>x</math>. The network must learn the full mapping by itself, which can be very hard when the network is very deep.
+ResNet changes this idea. It adds a skip connection that jumps over one or more layers. Because of this shortcut, the network does not try to learn <math>H(x)</math> directly. Instead, it learns a residual function:
+:<math>F(x) := H(x) - x</math>
+The output of the block becomes:
+:<math>Y = F(x) + x</math>
+So instead of learning everything from zero, the layers only learn the difference between the input and the output.
+The main advantage is about optimization. If the best mapping is close to just passing the input forward (an identity mapping), it is much easier for the network to make the nonlinear layers learn <math>F(x) = 0</math>. This can happen by pushing their weights close to zero. It is more easy than trying to force many nonlinear layers to behave exactly like an identity function.
+Because of this, adding more layers should not make the network worse. In theory, a deeper network can always perform at least as good as a shallow one, because it can just learn <math>F(x) = 0</math> and copy the shallow network behavior. But in practice, it still may need good training to make it work proper.<ref>Sandushi W. ''Understanding ResNet-50: Solving the Vanishing Gradient Problem with Skip Connections''. Medium. Available at: https://medium.com/@sandushiw98/understanding-resnet-50-solving-the-vanishing-gradient-problem-with-skip-connections-5591fcb7ff74</ref>
+This identity mapping means the original input <math>x</math> is added directly to the output of the next layers. Because of this, there is a second path where the signal can move forward easily, and the gradient can also move backward without problem.
+During backpropagation, the gradient of the loss <math>L</math> with respect to the input <math>x</math> has an extra added term. Since it is added (not multiplied many times like in normal deep networks), it avoids the problem where gradients become very small. In normal networks, gradients can shrink again and again because of multiplication, and they almost disappear. But here, the loss from the final output is still strongly felt by the first layers. So the early layers can learn better, and training becomes more stable.
+Sometimes, inside the network, the size of the feature maps changes. For example, the spatial size can become smaller, or the number of channels can increase. In this case, the skip connection cannot just add <math>x</math> directly because the dimensions are different. So ResNet uses a small <math>1 \times 1</math> convolution layer in the shortcut path to adjust the dimensions. This makes sure both feature maps have the same shape, so they can be added together without any mismatch. This help the network work properly even when the size changes.
 == References ==
 <references />

The Residual Network (ResNet) Standard: Unterschied zwischen den Versionen

Aktuelle Version vom 17. Februar 2026, 07:25 Uhr

Inhaltsverzeichnis

MATLAB Usage

The Math Behind Vanishing Gradients and Training Problems

Residual Learning

References

Navigationsmenü

The Residual Network (ResNet) Standard: Unterschied zwischen den Versionen

Aktuelle Version vom 17. Februar 2026, 07:25 Uhr

MATLAB Usage

The Math Behind Vanishing Gradients and Training Problems

Residual Learning

References

Navigationsmenü

Suche