YOLO Model for Object Detection

Aus HSHL Mechatronik
Zur Navigation springen Zur Suche springen

YOLO Model for LEGO Parts Detection

The LEGO Parts Detection System is a computer vision application designed to automatically identify and classify specific LEGO bricks within images. This implementation utilizes the YOLO11 (You Only Look Once, version 11) architecture from the Ultralytics library. The model is trained on a custom dataset of approximately 400 images using Google Colab and NVIDIA GPU acceleration.

Overview

Object detection involves locating instances of objects of certain classes within an image. Unlike standard image classification (which assigns a single label to an image), this YOLO model predicts:

  1. Bounding Boxes: The spatial coordinates (x,y,w,h) of the LEGO part.
  2. Class Probabilities: The specific type of LEGO brick (e.g., "2x4 Brick", "Technic Pin").

The system uses the `yolo11s` (Small) model variant, optimized for a balance between inference speed and detection accuracy, making it suitable for real-time applications.

Dataset Preparation

The performance of the model relies on a curated dataset processed through the following pipeline:

Data Collection and Annotation

  • Source: 400 images containing target LEGO parts in various orientations, lighting conditions, and backgrounds.
  • Annotation Tool: Label Studio.
  • Label Format: YOLO standard format, where each image has a corresponding `.txt` file containing lines in the format:
<class_id> <x_center> <y_center> <width> <height>
All coordinates are normalized between 0 and 1.

Preprocessing

Before training, the dataset is split to ensure robust evaluation:

  • Training Set: 90% of images (used to update model weights).
  • Validation Set: 10% of images (used to evaluate performance during training).
  • Configuration: A `data.yaml` file is generated dynamically to map the directory paths and class names for the training engine.

Network Architecture

The YOLO11 architecture is a single-stage object detector. It processes the entire image in a single forward pass, distinguishing it from two-stage detectors like R-CNN. The architecture consists of three main components:

Component Function Description
Backbone Feature Extraction A Convolutional Neural Network (based on CSPDarknet) that downsamples the image to extract distinct features (edges, textures, shapes) at different scales. It utilizes C3k2 blocks (Cross Stage Partial networks with specific kernel sizes) to improve gradient flow and reduce computational cost.
Neck Feature Fusion Uses PANet (Path Aggregation Network) layers to combine features from different backbone levels. This ensures that the model can detect both large (close-up) and small (distant) LEGO parts effectively.
Head Prediction A decoupled head that separates the classification task (what is it?) from the regression task (where is it?). It outputs the final bounding boxes and class scores.

Mathematical Description of Core Operations

SiLU Activation Function

The hidden layers of the network use the Sigmoid Linear Unit (SiLU) activation function. It allows for smoother gradient propagation compared to the traditional ReLU.

f(x)=xσ(x)=x1+ex

Intersection over Union (IoU)

To measure how well a predicted box overlaps with the ground truth box during training, the Intersection over Union metric is used:

IoU=Area(BpBgt)Area(BpBgt)

Where Bp is the predicted bounding box and Bgt is the ground truth box.

Loss Function

The model optimizes a composite loss function Ltotal that combines three distinct error measurements:

Ltotal=λboxLbox+λclsLcls+λdflLdfl
  1. Box Loss (Lbox): Measures the error in the coordinate predictions. YOLO11 typically uses CIoU (Complete IoU) loss, which accounts for overlap, center point distance, and aspect ratio consistency.
  2. Class Loss (Lcls): Measures the error in classification using Binary Cross Entropy (BCE):
    Lcls=i[yilog(y^i)+(1yi)log(1y^i)]
  3. DFL Loss (Ldfl): Distribution Focal Loss, used to refine the localization of the bounding box boundaries.

Non-Maximum Suppression (NMS)

During inference, the model may predict multiple overlapping boxes for a single LEGO part. NMS filters these to keep only the best prediction.

  1. Select the box with the highest confidence score.
  2. Calculate IoU between this box and all other boxes.
  3. Discard boxes with an IoU threshold higher than a set limit (e.g., 0.5).

Implementation

Training Configuration

The model is trained using the Python SDK. The training process runs for 60 epochs with an image size of 640 pixels.

!yolo detect train \
    data=/content/data.yaml \
    model=yolo11s.pt \
    epochs=60 \
    imgsz=640

Inference

Once trained, the best weights (`best.pt`) are used to predict classes on new images.

!yolo detect predict \
    model=runs/detect/train/weights/best.pt \
    source=data/validation/images \
    save=True

Core Logic and Code Implementation

The implementation relies on a specific workflow to bridge the gap between raw data and the Ultralytics YOLO engine. The core logic involves dynamically generating configuration files and defining the training hyperparameters.

Automated Configuration Generation

YOLO models require a specific YAML configuration file to locate the dataset and identify class names. Instead of manually creating this file, the system uses a Python function to parse the raw label map (`classes.txt`) and generate the `data.yaml` file programmatically.

This ensures that the class indices ($0, 1, 2...$) perfectly match the class names (e.g., "brick_2x4") during training.

import yaml

def create_data_yaml(path_to_classes_txt, path_to_data_yaml):
    """
    Parses a raw text file of class names and generates 
    the YAML configuration required by YOLO.
    """
    # 1. Read class names from the text file
    with open(path_to_classes_txt, 'r') as f:
        classes = [line.strip() for line in f.readlines() if line.strip()]

    # 2. Define the dictionary structure required by YOLO
    data = {
        'path': '/content/data',       # Root directory
        'train': 'train/images',       # Subpath to training images
        'val': 'validation/images',    # Subpath to validation images
        'nc': len(classes),            # Number of Classes
        'names': classes               # List of Class Names
    }

    # 3. Serialize the dictionary to a YAML file
    with open(path_to_data_yaml, 'w') as f:
        yaml.dump(data, f, sort_keys=False)

# Execution
create_data_yaml('/content/custom_data/classes.txt', '/content/data.yaml')

Data Partitioning Logic

To prevent overfitting, the raw dataset is split into training and validation subsets. The system uses a 90/10 split ratio. This logic is handled by an external utility script (`train_val_split.py`) which randomizes the files to ensure a representative distribution of LEGO parts in both sets.

# 90% Training data, 10% Validation data
!python train_val_split.py --datapath="/content/custom_data" --train_pct=0.9

Model Initialization and Training

The core training loop is initiated via the Command Line Interface (CLI). The logic here defines the hardware constraints and the duration of the learning process.

!yolo detect train \
    data=/content/data.yaml \  # Path to the config generated above
    model=yolo11s.pt \         # Load the 'Small' pretrained weights
    epochs=60 \                # Iterate over the dataset 60 times
    imgsz=640                  # Resize all inputs to 640x640 pixels
  • Pre-trained Weights (`yolo11s.pt`): The model uses Transfer Learning, starting with weights learned from the COCO dataset rather than random values. This significantly speeds up convergence for the custom LEGO dataset.
  • Image Size (`imgsz=640`): The native resolution of the network. LEGO images are automatically resized (downsampled or upsampled) to this dimension before entering the backbone.

Evaluation and Results

Upon completion of the training phase, the model's performance is qualitatively evaluated by running inference on unseen images from the validation set. The output consists of the original input images overlaid with prediction annotations.