Vision Transformers (ViT)

- Vision Transformers (ViT)**

A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.

- Global Context:** Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.

- MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.

Vision Transformers (ViT)

Navigationsmenü

Suche