How It Works - AI Dog Detection Technology

The Pipeline at a Glance

When a dog walks past the camera, it triggers a series of AI models. Here's what happens in about 500 milliseconds:

Camera Feed

Scrypted NVR

Detection

YOLOv5

Crop & Filter

OpenCV

Breed ID

ViT

Recognition

DINOv2

Stage 1: Spotting Dogs with YOLO

YOLOv5s ultralytics/yolov5 ↗

The first stage runs YOLO object detection on the camera feed. This can be deployed in several ways: on an NVR like Scrypted, on a separate server processing RTSP streams, or directly on edge devices. YOLO (You Only Look Once) is well-suited for this because it's fast enough to process video in real-time.

When YOLO identifies a potential dog (class ID 16 in the COCO dataset, confidence > 30%), it saves the frame with bounding box coordinates. The downstream pipeline checks for new detections periodically and processes them in batches.

Why batch processing? Running ML models one image at a time is inefficient. Processing 30-50 images per batch yields about 3-5x better throughput on the GPU. The hourly batch job takes around 10-15 minutes to process a full day of detections.

Stage 2: Cropping & Quality Control

OpenCV PIL

Not every detection is worth keeping. Dogs running fast produce motion blur. Distant dogs are only a few pixels. Some "dogs" are actually humans (YOLO occasionally misclassifies). This stage filters out low-quality detections.

Quality checks

Blur detection — Uses the Laplacian variance method. Sharp images score above 100; blurry ones are discarded.
Minimum size — Crops smaller than 64×64 pixels lack sufficient detail for breed classification.
Bounding box expansion — The crop is padded by 25% on each side to avoid cutting off ears or tails.

When a dog triggers multiple frames in a single visit, the system retains only the sharpest image.

Stage 3: Breed Classification

wesleyacheng/dog-breeds-multiclass-image-classification-with-vit ↗

This stage determines what kind of dog is in the frame. The system uses a Vision Transformer (ViT) model fine-tuned on dog breed images, capable of identifying 120+ breeds with high accuracy.

How ViT works

Unlike traditional CNNs that slide filters across an image, Vision Transformers divide the image into patches, flatten them into sequences, and process them with the same attention mechanism that powers GPT. Notably, the same architecture works effectively for both text and images.

# The model outputs probabilities for each breed
{
  "Golden_Retriever": 0.847,
  "Labrador_Retriever": 0.089,
  "Flat-Coated_Retriever": 0.031,
  ...
}
            

Handling mixed breeds

Mixed breeds like goldendoodles, labradoodles, and bernedoodles are common but weren't in the training data. The model often hedges between parent breeds for these dogs, so the system retains the top-3 predictions to handle ambiguous cases.

Stage 4: Individual Dog Recognition

facebook/dinov2-base ↗ ArcFace

This is the most sophisticated stage: recognizing that the golden retriever from Tuesday is the same one from last week. It works like facial recognition, applied to dogs.

Visual fingerprints with DINOv2

DINOv2 is a self-supervised vision model from Meta that learns to understand images without labels. It generates a 768-dimensional "embedding" for each dog—a mathematical fingerprint that captures what makes that dog visually unique.

Matching with ArcFace

Raw DINOv2 embeddings work reasonably well, but aren't optimal for re-identification. A projection head trained with ArcFace loss improves this—the same technique used in human face recognition systems. It compresses the 768 dimensions to 256 and spreads out similar-looking dogs in the embedding space.

# When a new dog is detected:
new_embedding = model(dog_image)  # 256-dim vector

# Compare against all known dogs
for known_dog in database:
    similarity = cosine_similarity(new_embedding, known_dog.embedding)
    if similarity > 0.65:
        # Same dog! Link to existing profile
        match = known_dog
        break
else:
    # New dog! Create profile
    create_dog_profile(new_embedding)
            

Building dog profiles

Each recognized dog gets a profile that tracks: first and last visit dates, total visit count, all detected images, and the average breed confidence. Some dogs visit daily; others appear only once. Over time, the system builds familiarity with regular visitors.

Privacy First

openai/clip-vit-base-patch32 ↗

The camera inevitably captures people as well—mail carriers, neighbors, pedestrians. To protect privacy, the system actively filters out any human detections.

Two-stage human filtering

Geometric filter — Humans have different proportions than dogs. If a detection is tall and narrow (aspect ratio > 2.2), it's likely a person standing upright. This catches most cases instantly.
CLIP verification — For ambiguous cases, OpenAI's CLIP model performs zero-shot classification: "Is this a dog or a human?" If human confidence exceeds 40%, the detection is deleted.

What gets deleted: Any detection flagged as human is immediately removed—both the image file and the database record. No human photos are retained, period.

GPU Acceleration with Apple Silicon

The entire pipeline runs on a Mac Mini M2 in a home server setup. Apple's Metal Performance Shaders (MPS) backend allows PyTorch to leverage the GPU, which significantly improves ML inference performance.

~100ms

Per detection

32

Batch size

3-5x

Batch speedup

$0

Monthly cost

Why local processing matters

Privacy — Camera footage never leaves the local network
No API costs — Cloud vision APIs on every detection would be cost-prohibitive
Speed — No network round-trips means faster processing
Control — Full ownership of data and flexibility to adjust models

The Tech Stack

Camera — Reolink RLC-810WA (any RTSP-compatible camera works)
Object Detection — YOLOv5s (deployable via NVR, server, or edge device)
Breed Classification — ViT fine-tuned on dog breeds
Re-identification — DINOv2 + custom ArcFace head
Human Filtering — CLIP zero-shot classification
Backend — Python, Flask, SQLite
Hardware — Mac Mini M2, runs 24/7

Run It Yourself

The system is open source. Anyone with a camera pointed at an area where dogs pass by can set up their own instance. Requirements:

A compatible IP camera (most RTSP cameras work)
YOLO object detection — options include an NVR like Scrypted, a separate server, or edge deployment
A machine with a GPU for the ML pipeline (Apple Silicon, NVIDIA, etc.)
Python 3.10+ and approximately 10GB of storage for models

The setup process requires some familiarity with Python and command-line tools. Work is ongoing to simplify deployment.

How DogBot Works