Vision Transformers (ViT): Unterschied zwischen den Versionen

Aktuelle Version vom 11. Februar 2026, 14:11 Uhr

A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.^[1]

Global Context: Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.

MATLAB Support: MATLAB support ViT models, like visionTransformer, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.^[1]

References

↑ ^1,0 ^1,1 MathWorks. Train Vision Transformer Network for Image Classification. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html

[MathWorksViT-1] 1,0 ^1,1 MathWorks. Train Vision Transformer Network for Image Classification. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html

[1]

@@ Zeile 1: / Zeile 1: @@
-**Vision Transformers (ViT)**
-A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.
+A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.<ref name="MathWorksViT">MathWorks. ''Train Vision Transformer Network for Image Classification''. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html</ref>
-**Global Context:** Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.
+'''Global Context:''' Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.
-**MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.
+'''MATLAB Support:''' MATLAB support ViT models, like <tt>visionTransformer</tt>, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.<ref name="MathWorksViT" />
+== References ==
+<references />

Vision Transformers (ViT): Unterschied zwischen den Versionen

Aktuelle Version vom 11. Februar 2026, 14:11 Uhr

References

Navigationsmenü

Suche