Vision Transformers (ViT)

Aus HSHL Mechatronik
Version vom 11. Februar 2026, 14:07 Uhr von Ajay.paul@stud.hshl.de (Diskussion | Beiträge) (Die Seite wurde neu angelegt: „**Vision Transformers (ViT)** A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism. **Global Context:** Unlike CNNs that first lo…“)
(Unterschied) ← Nächstältere Version | Aktuelle Version (Unterschied) | Nächstjüngere Version → (Unterschied)
Zur Navigation springen Zur Suche springen
    • Vision Transformers (ViT)**

A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.

    • Global Context:** Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.
    • MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.