Vision Transformers (ViT) - Versionsgeschichte

Ajay.paul@stud.hshl.de am 11. Februar 2026 um 13:11 Uhr

2026-02-11T13:11:16Z

← Nächstältere Version		Version vom 11. Februar 2026, 13:11 Uhr
Zeile 1:		Zeile 1:
	~~== Vision Transformers (ViT) ==~~
	A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.<ref name="MathWorksViT">MathWorks. ''Train Vision Transformer Network for Image Classification''. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html</ref>		A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.<ref name="MathWorksViT">MathWorks. ''Train Vision Transformer Network for Image Classification''. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html</ref>

Ajay.paul@stud.hshl.de am 11. Februar 2026 um 13:10 Uhr

2026-02-11T13:10:58Z

← Nächstältere Version		Version vom 11. Februar 2026, 13:10 Uhr
Zeile 1:		Zeile 1:
	Vision Transformers (ViT)		== Vision Transformers (ViT) ==
			A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.<ref name="MathWorksViT">MathWorks. ''Train Vision Transformer Network for Image Classification''. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html</ref>

	~~A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like~~ CNNs, ViT ~~use~~ the ~~Transformer model, which was~~ first ~~made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed~~ using self-attention ~~mechanism~~.		'''Global Context:''' Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.

	Global Context: Unlike CNNs that first look at small local features, ~~ViT can see~~ the ~~global context from the first layer using self~~-~~attention. This help it understand the whole image~~ more ~~directly~~.		'''MATLAB Support:''' MATLAB support ViT models, like <tt>visionTransformer</tt>, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.<ref name="MathWorksViT" />

	MATLAB Support: MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.		== References ==
			<references />

Ajay.paul@stud.hshl.de: Die Seite wurde neu angelegt: „Vision Transformers (ViT) A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism. Global Context: Unlike CNNs that first lo…“

2026-02-11T13:07:22Z

Die Seite wurde neu angelegt: „**Vision Transformers (ViT)** A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism. **Global Context:** Unlike CNNs that first lo…“

Neue Seite

**Vision Transformers (ViT)**

A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.

**Global Context:** Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.

**MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.