MobileNetV2: Efficiency for Edge Computing: Unterschied zwischen den Versionen

Aus HSHL Mechatronik
Zur Navigation springen Zur Suche springen
Ajay.paul@stud.hshl.de (Diskussion | Beiträge)
Keine Bearbeitungszusammenfassung
Ajay.paul@stud.hshl.de (Diskussion | Beiträge)
 
(2 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 9: Zeile 9:
The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of ''h<sub>i</sub>'' &times; ''w<sub>i</sub>'' with ''d<sub>i</sub>'' channels, and it use a filter of size ''k'' &times; ''k'' to make ''d<sub>j</sub>'' output channels, the math cost is very big. The cost is calculated like this: ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''d<sub>j</sub>'' &middot; ''k''<sup>2</sup>. This math make the computer work too much hard when the network get deeper and have more channels.
The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of ''h<sub>i</sub>'' &times; ''w<sub>i</sub>'' with ''d<sub>i</sub>'' channels, and it use a filter of size ''k'' &times; ''k'' to make ''d<sub>j</sub>'' output channels, the math cost is very big. The cost is calculated like this: ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''d<sub>j</sub>'' &middot; ''k''<sup>2</sup>. This math make the computer work too much hard when the network get deeper and have more channels.


MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.<ref>GeeksforGeeks. ''MobileNetV2 Architecture in Computer Vision''. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/</ref> This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.<ref>Medium. ''A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper''. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a</ref> This split the spatial work from the channel work, which make it run very fast.
MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.<ref name="GFG">GeeksforGeeks. ''MobileNetV2 Architecture in Computer Vision''. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/</ref> This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.<ref name="Medium">Medium. ''A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper''. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a</ref> This split the spatial work from the channel work, which make it run very fast.
 
In the first step, the depthwise convolution put one 2D filter (''k'' × ''k'') on every input channel by itself.<ref name="GFG" /> Not like normal convolution that mix all channel together, this depthwise step only learn the space parts alone. After this space work, the pointwise convolution use a very small 1 × 1 filter to mix the output from the depthwise step.<ref name="Medium" /> This 1 × 1 convolution mix the independent space features across the channels, so it learn how channels connect without doing the space math again.<ref name="Medium" />
 
The good math of this split is easy to see when we count the new math cost. Because the steps happen one after another, not same time, we just add the cost, not multiply them big. The computer work for depthwise separable convolution is just two part added together: the depthwise cost plus the 1 &times; 1 pointwise cost. This is write like ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''k''<sup>2</sup> + ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''d<sub>j</sub>''.<ref name="Sandler">Sandler, M. et al. ''MobileNetV2: Inverted Residuals and Linear Bottlenecks''. CVPR 2018. Available at: https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf</ref> If we take out the same parts, the math become simple to ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>''(''k''<sup>2</sup> + ''d<sub>j</sub>'').<ref name="Sandler" />
 
When we compare this cost to a normal convolution layer, the cut down factor get close to 1 / ''d<sub>j</sub>'' + 1 / ''k''<sup>2</sup>. Because the MobileNetV2 network mostly use 3 &times; 3 depthwise convolution (where ''k'' = 3), the math reduction is very big.<ref name="Sandler" /> The computer cost and the total parameter number is go down by 8 to 9 times compare to normal convolutions, getting this big speed gain with only a very tiny drop in the correct guess.<ref name="Medium" /> This basic split make sure the network can stay enough deep and wide to learn hard picture patterns without making the small processor in mobile phone work too much.<ref name="Sandler" />


== References ==
== References ==
<references />
<references />

Aktuelle Version vom 27. März 2026, 15:30 Uhr

For cases where computing power is limited, like mobile apps or embedded devices, MobileNetV2 is usually the best choice. It use something called “depthwise separable convolutions,” which break one big convolution into two smaller steps. This reduce the number of parameters and calculations by alot.[1]

Inference Speed: MobileNetV2 is made to be fast. In test benchmarks, it can run inference in about 15ms per image, which is much faster then most ResNet models.[1]

Accuracy vs Size: Even though it is much smaller (around 3.5 million parameters compared to 25 million in ResNet-50), it still keep good accuracy (about 71–72% on ImageNet). Because of this, it is very suitable for the “Edge Computing” option in a Zwicky Box analysis.[2]

The Mathematical Foundation

The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of hi × wi with di channels, and it use a filter of size k × k to make dj output channels, the math cost is very big. The cost is calculated like this: hi · wi · di · dj · k2. This math make the computer work too much hard when the network get deeper and have more channels.

MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.[3] This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.[4] This split the spatial work from the channel work, which make it run very fast.

In the first step, the depthwise convolution put one 2D filter (k × k) on every input channel by itself.[3] Not like normal convolution that mix all channel together, this depthwise step only learn the space parts alone. After this space work, the pointwise convolution use a very small 1 × 1 filter to mix the output from the depthwise step.[4] This 1 × 1 convolution mix the independent space features across the channels, so it learn how channels connect without doing the space math again.[4]

The good math of this split is easy to see when we count the new math cost. Because the steps happen one after another, not same time, we just add the cost, not multiply them big. The computer work for depthwise separable convolution is just two part added together: the depthwise cost plus the 1 × 1 pointwise cost. This is write like hi · wi · di · k2 + hi · wi · di · dj.[5] If we take out the same parts, the math become simple to hi · wi · di(k2 + dj).[5]

When we compare this cost to a normal convolution layer, the cut down factor get close to 1 / dj + 1 / k2. Because the MobileNetV2 network mostly use 3 × 3 depthwise convolution (where k = 3), the math reduction is very big.[5] The computer cost and the total parameter number is go down by 8 to 9 times compare to normal convolutions, getting this big speed gain with only a very tiny drop in the correct guess.[4] This basic split make sure the network can stay enough deep and wide to learn hard picture patterns without making the small processor in mobile phone work too much.[5]

References

  1. 1,0 1,1 Joshua, Chidiebere & Kotsis, Konstantinos & Ghosh, Sourangshu. (2025). Comparative Evaluation of ResNet, EfficientNet, and MobileNet for Accurate Classification of Babylonian Sexagesimal Numerals.
  2. Ahmed, I. Why Your MobileNetV2 Model Performs Better Than ResNet50. Medium. Available at: https://medium.com/@imtiaz.ahmed2206/why-your-mobilenetv2-model-performs-better-than-resnet50-2a9998fda4c7
  3. 3,0 3,1 GeeksforGeeks. MobileNetV2 Architecture in Computer Vision. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/
  4. 4,0 4,1 4,2 4,3 Medium. A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a
  5. 5,0 5,1 5,2 5,3 Sandler, M. et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR 2018. Available at: https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf