How DogBot Works

DogBot is an AI system that monitors a camera feed for dogs. When one walks by, it identifies the breed and, over time, learns to recognize individual dogs that visit regularly. The entire system runs locally on a Mac Mini—no cloud services, no subscriptions.

The Pipeline at a Glance

When a dog walks past the camera, it triggers a series of AI models. Here's what happens in about 500 milliseconds:

Camera Feed
Scrypted NVR
Detection
YOLOv5
Crop & Filter
OpenCV
Breed ID
ViT
Recognition
DINOv2

Stage 1: Spotting Dogs with YOLO

YOLOv5s ultralytics/yolov5 ↗

The first stage runs YOLO object detection on the camera feed. This can be deployed in several ways: on an NVR like Scrypted, on a separate server processing RTSP streams, or directly on edge devices. YOLO (You Only Look Once) is well-suited for this because it's fast enough to process video in real-time.

When YOLO identifies a potential dog (class ID 16 in the COCO dataset, confidence > 30%), it saves the frame with bounding box coordinates. The downstream pipeline checks for new detections periodically and processes them in batches.

Why batch processing? Running ML models one image at a time is inefficient. Processing 30-50 images per batch yields about 3-5x better throughput on the GPU. The hourly batch job takes around 10-15 minutes to process a full day of detections.

Stage 2: Cropping & Quality Control

OpenCV PIL

Not every detection is worth keeping. Dogs running fast produce motion blur. Distant dogs are only a few pixels. Some "dogs" are actually humans (YOLO occasionally misclassifies). This stage filters out low-quality detections.

Quality checks

When a dog triggers multiple frames in a single visit, the system retains only the sharpest image.

Stage 3: Breed Classification

wesleyacheng/dog-breeds-multiclass-image-classification-with-vit ↗

This stage determines what kind of dog is in the frame. The system uses a Vision Transformer (ViT) model fine-tuned on dog breed images, capable of identifying 120+ breeds with high accuracy.

How ViT works

Unlike traditional CNNs that slide filters across an image, Vision Transformers divide the image into patches, flatten them into sequences, and process them with the same attention mechanism that powers GPT. Notably, the same architecture works effectively for both text and images.

# The model outputs probabilities for each breed { "Golden_Retriever": 0.847, "Labrador_Retriever": 0.089, "Flat-Coated_Retriever": 0.031, ... }

Handling mixed breeds

Mixed breeds like goldendoodles, labradoodles, and bernedoodles are common but weren't in the training data. The model often hedges between parent breeds for these dogs, so the system retains the top-3 predictions to handle ambiguous cases.

Stage 4: Individual Dog Recognition

facebook/dinov2-base ↗ ArcFace

This is the most sophisticated stage: recognizing that the golden retriever from Tuesday is the same one from last week. It works like facial recognition, applied to dogs.

Visual fingerprints with DINOv2

DINOv2 is a self-supervised vision model from Meta that learns to understand images without labels. It generates a 768-dimensional "embedding" for each dog—a mathematical fingerprint that captures what makes that dog visually unique.

Matching with ArcFace

Raw DINOv2 embeddings work reasonably well, but aren't optimal for re-identification. A projection head trained with ArcFace loss improves this—the same technique used in human face recognition systems. It compresses the 768 dimensions to 256 and spreads out similar-looking dogs in the embedding space.

# When a new dog is detected: new_embedding = model(dog_image) # 256-dim vector # Compare against all known dogs for known_dog in database: similarity = cosine_similarity(new_embedding, known_dog.embedding) if similarity > 0.65: # Same dog! Link to existing profile match = known_dog break else: # New dog! Create profile create_dog_profile(new_embedding)

Building dog profiles

Each recognized dog gets a profile that tracks: first and last visit dates, total visit count, all detected images, and the average breed confidence. Some dogs visit daily; others appear only once. Over time, the system builds familiarity with regular visitors.

Privacy First

openai/clip-vit-base-patch32 ↗

The camera inevitably captures people as well—mail carriers, neighbors, pedestrians. To protect privacy, the system actively filters out any human detections.

Two-stage human filtering

What gets deleted: Any detection flagged as human is immediately removed—both the image file and the database record. No human photos are retained, period.

GPU Acceleration with Apple Silicon

The entire pipeline runs on a Mac Mini M2 in a home server setup. Apple's Metal Performance Shaders (MPS) backend allows PyTorch to leverage the GPU, which significantly improves ML inference performance.

~100ms
Per detection
32
Batch size
3-5x
Batch speedup
$0
Monthly cost

Why local processing matters

The Tech Stack

Run It Yourself

The system is open source. Anyone with a camera pointed at an area where dogs pass by can set up their own instance. Requirements:

The setup process requires some familiarity with Python and command-line tools. Work is ongoing to simplify deployment.

Interested in deploying DogBot?

Leave your email to be notified when the setup process is more polished, or to get assistance with deployment.