MobileNetV2: Efficiency for Edge Computing: Unterschied zwischen den Versionen

Aus HSHL Mechatronik
Zur Navigation springen Zur Suche springen
Ajay.paul@stud.hshl.de (Diskussion | Beiträge)
Keine Bearbeitungszusammenfassung
Ajay.paul@stud.hshl.de (Diskussion | Beiträge)
Keine Bearbeitungszusammenfassung
Zeile 9: Zeile 9:
The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of ''h<sub>i</sub>'' &times; ''w<sub>i</sub>'' with ''d<sub>i</sub>'' channels, and it use a filter of size ''k'' &times; ''k'' to make ''d<sub>j</sub>'' output channels, the math cost is very big. The cost is calculated like this: ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''d<sub>j</sub>'' &middot; ''k''<sup>2</sup>. This math make the computer work too much hard when the network get deeper and have more channels.
The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of ''h<sub>i</sub>'' &times; ''w<sub>i</sub>'' with ''d<sub>i</sub>'' channels, and it use a filter of size ''k'' &times; ''k'' to make ''d<sub>j</sub>'' output channels, the math cost is very big. The cost is calculated like this: ''h<sub>i</sub>'' &middot; ''w<sub>i</sub>'' &middot; ''d<sub>i</sub>'' &middot; ''d<sub>j</sub>'' &middot; ''k''<sup>2</sup>. This math make the computer work too much hard when the network get deeper and have more channels.


MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.<ref>GeeksforGeeks. ''MobileNetV2 Architecture in Computer Vision''. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/</ref> This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.<ref>Medium. ''A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper''. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a</ref> This split the spatial work from the channel work, which make it run very fast.
MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.<ref name="GFG">GeeksforGeeks. ''MobileNetV2 Architecture in Computer Vision''. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/</ref> This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.<ref name="Medium">Medium. ''A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper''. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a</ref> This split the spatial work from the channel work, which make it run very fast.
 
In the first step, the depthwise convolution put one 2D filter (''k'' × ''k'') on every input channel by itself.<ref name="GFG" /> Not like normal convolution that mix all channel together, this depthwise step only learn the space parts alone. After this space work, the pointwise convolution use a very small 1 × 1 filter to mix the output from the depthwise step.<ref name="Medium" /> This 1 × 1 convolution mix the independent space features across the channels, so it learn how channels connect without doing the space math again.<ref name="Medium" />


== References ==
== References ==
<references />
<references />

Version vom 26. März 2026, 16:59 Uhr

For cases where computing power is limited, like mobile apps or embedded devices, MobileNetV2 is usually the best choice. It use something called “depthwise separable convolutions,” which break one big convolution into two smaller steps. This reduce the number of parameters and calculations by alot.[1]

Inference Speed: MobileNetV2 is made to be fast. In test benchmarks, it can run inference in about 15ms per image, which is much faster then most ResNet models.[1]

Accuracy vs Size: Even though it is much smaller (around 3.5 million parameters compared to 25 million in ResNet-50), it still keep good accuracy (about 71–72% on ImageNet). Because of this, it is very suitable for the “Edge Computing” option in a Zwicky Box analysis.[2]

The Mathematical Foundation

The good design and fast speed of MobileNetV2 is because it do not use normal convolution. In a normal CNN layer, the filter do everything together at the same time. One filter look at the picture size and also mix all the color channels together. If a normal convolutions get a picture size of hi × wi with di channels, and it use a filter of size k × k to make dj output channels, the math cost is very big. The cost is calculated like this: hi · wi · di · dj · k2. This math make the computer work too much hard when the network get deeper and have more channels.

MobileNetV2 fix this big math problem by using something called depthwise separable convolutions.[3] This idea was start in MobileNetV1 and make more better in V2. It split the big normal convolution into two small step: first a depthwise convolution, and then a pointwise convolution.[4] This split the spatial work from the channel work, which make it run very fast.

In the first step, the depthwise convolution put one 2D filter (k × k) on every input channel by itself.[3] Not like normal convolution that mix all channel together, this depthwise step only learn the space parts alone. After this space work, the pointwise convolution use a very small 1 × 1 filter to mix the output from the depthwise step.[4] This 1 × 1 convolution mix the independent space features across the channels, so it learn how channels connect without doing the space math again.[4]

References

  1. 1,0 1,1 Joshua, Chidiebere & Kotsis, Konstantinos & Ghosh, Sourangshu. (2025). Comparative Evaluation of ResNet, EfficientNet, and MobileNet for Accurate Classification of Babylonian Sexagesimal Numerals.
  2. Ahmed, I. Why Your MobileNetV2 Model Performs Better Than ResNet50. Medium. Available at: https://medium.com/@imtiaz.ahmed2206/why-your-mobilenetv2-model-performs-better-than-resnet50-2a9998fda4c7
  3. 3,0 3,1 GeeksforGeeks. MobileNetV2 Architecture in Computer Vision. Available at: https://www.geeksforgeeks.org/computer-vision/mobilenet-v2-architecture-in-computer-vision/
  4. 4,0 4,1 4,2 Medium. A summary of the MobileNetV2: Inverted Residuals and Linear Bottlenecks paper. Available at: https://medium.com/codex/a-summary-of-the-mobilenetv2-inverted-residuals-and-linear-bottlenecks-paper-e19b187cb78a