<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="de">
	<id>https://wiki.hshl.de/wiki/index.php?action=history&amp;feed=atom&amp;title=Vision_Transformers_%28ViT%29</id>
	<title>Vision Transformers (ViT) - Versionsgeschichte</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.hshl.de/wiki/index.php?action=history&amp;feed=atom&amp;title=Vision_Transformers_%28ViT%29"/>
	<link rel="alternate" type="text/html" href="https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;action=history"/>
	<updated>2026-04-19T20:02:10Z</updated>
	<subtitle>Versionsgeschichte dieser Seite in HSHL Mechatronik</subtitle>
	<generator>MediaWiki 1.43.0</generator>
	<entry>
		<id>https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147002&amp;oldid=prev</id>
		<title>Ajay.paul@stud.hshl.de am 11. Februar 2026 um 13:11 Uhr</title>
		<link rel="alternate" type="text/html" href="https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147002&amp;oldid=prev"/>
		<updated>2026-02-11T13:11:16Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;de&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Nächstältere Version&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Version vom 11. Februar 2026, 13:11 Uhr&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Zeile 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Zeile 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Vision Transformers (ViT) ==&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.&amp;lt;ref name=&amp;quot;MathWorksViT&amp;quot;&amp;gt;MathWorks. &amp;#039;&amp;#039;Train Vision Transformer Network for Image Classification&amp;#039;&amp;#039;. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.&amp;lt;ref name=&amp;quot;MathWorksViT&amp;quot;&amp;gt;MathWorks. &amp;#039;&amp;#039;Train Vision Transformer Network for Image Classification&amp;#039;&amp;#039;. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html&amp;lt;/ref&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key mtrwiki:diff:1.41:old-147001:rev-147002:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>Ajay.paul@stud.hshl.de</name></author>
	</entry>
	<entry>
		<id>https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147001&amp;oldid=prev</id>
		<title>Ajay.paul@stud.hshl.de am 11. Februar 2026 um 13:10 Uhr</title>
		<link rel="alternate" type="text/html" href="https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147001&amp;oldid=prev"/>
		<updated>2026-02-11T13:10:58Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;de&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Nächstältere Version&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Version vom 11. Februar 2026, 13:10 Uhr&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Zeile 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Zeile 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;**&lt;/del&gt;Vision Transformers (ViT)&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;**&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== &lt;/ins&gt;Vision Transformers (ViT) &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.&amp;lt;ref name=&quot;MathWorksViT&quot;&amp;gt;MathWorks. &#039;&#039;Train Vision Transformer Network for Image Classification&#039;&#039;. Available at: https://www.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html&amp;lt;/ref&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like &lt;/del&gt;CNNs, ViT &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;use &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Transformer model, which was &lt;/del&gt;first &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed &lt;/del&gt;using self-attention &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;mechanism&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;Global Context:&#039;&#039;&#039; Unlike &lt;/ins&gt;CNNs &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;that first look at small local features&lt;/ins&gt;, ViT &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;can see the global context from &lt;/ins&gt;the first &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;layer &lt;/ins&gt;using self-attention&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;. This help it understand the whole image more directly&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;**Global Context&lt;/del&gt;:&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;** Unlike CNNs that first look at small local features&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;ViT can see &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;global context from the first layer using self&lt;/del&gt;-&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;attention. This help it understand the whole image &lt;/del&gt;more &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;directly&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;MATLAB Support&lt;/ins&gt;:&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039; MATLAB support ViT models&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;like &amp;lt;tt&amp;gt;visionTransformer&amp;lt;/tt&amp;gt;, in &lt;/ins&gt;the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Computer Vision Toolbox. They are very powerful when fine&lt;/ins&gt;-&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;tuned on big datasets, but they may need &lt;/ins&gt;more &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;data to train properly if starting from scratch, compared to CNN models&lt;/ins&gt;.&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;ref name=&quot;MathWorksViT&quot; /&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;**MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== References ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&amp;lt;references /&amp;gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Ajay.paul@stud.hshl.de</name></author>
	</entry>
	<entry>
		<id>https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147000&amp;oldid=prev</id>
		<title>Ajay.paul@stud.hshl.de: Die Seite wurde neu angelegt: „**Vision Transformers (ViT)**  A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.  **Global Context:** Unlike CNNs that first lo…“</title>
		<link rel="alternate" type="text/html" href="https://wiki.hshl.de/wiki/index.php?title=Vision_Transformers_(ViT)&amp;diff=147000&amp;oldid=prev"/>
		<updated>2026-02-11T13:07:22Z</updated>

		<summary type="html">&lt;p&gt;Die Seite wurde neu angelegt: „**Vision Transformers (ViT)**  A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.  **Global Context:** Unlike CNNs that first lo…“&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Neue Seite&lt;/b&gt;&lt;/p&gt;&lt;div&gt;**Vision Transformers (ViT)**&lt;br /&gt;
&lt;br /&gt;
A new change in the field is the Vision Transformer (ViT). Instead of using the normal convolution layers like CNNs, ViT use the Transformer model, which was first made for Natural Language Processing (NLP). The image is cut into small fixed-size patches, then each patch is turned into a vector and treated like a token. After that, it is processed using self-attention mechanism.&lt;br /&gt;
&lt;br /&gt;
**Global Context:** Unlike CNNs that first look at small local features, ViT can see the global context from the first layer using self-attention. This help it understand the whole image more directly.&lt;br /&gt;
&lt;br /&gt;
**MATLAB Support:** MATLAB support ViT models, like `visionTransformer`, in the Computer Vision Toolbox. They are very powerful when fine-tuned on big datasets, but they may need more data to train properly if starting from scratch, compared to CNN models.&lt;/div&gt;</summary>
		<author><name>Ajay.paul@stud.hshl.de</name></author>
	</entry>
</feed>