How DogBot Works
DogBot is an AI system that monitors a camera feed for dogs. When one walks by, it identifies the breed and, over time, learns to recognize individual dogs that visit regularly. The entire system runs locally on a Mac Mini—no cloud services, no subscriptions.
The Pipeline at a Glance
When a dog walks past the camera, it triggers a series of AI models. Here's what happens in about 500 milliseconds:
Stage 1: Spotting Dogs with YOLO
YOLOv5s ultralytics/yolov5 ↗The first stage runs YOLO object detection on the camera feed. This can be deployed in several ways: on an NVR like Scrypted, on a separate server processing RTSP streams, or directly on edge devices. YOLO (You Only Look Once) is well-suited for this because it's fast enough to process video in real-time.
When YOLO identifies a potential dog (class ID 16 in the COCO dataset, confidence > 30%), it saves the frame with bounding box coordinates. The downstream pipeline checks for new detections periodically and processes them in batches.
Why batch processing? Running ML models one image at a time is inefficient. Processing 30-50 images per batch yields about 3-5x better throughput on the GPU. The hourly batch job takes around 10-15 minutes to process a full day of detections.
Stage 2: Cropping & Quality Control
OpenCV PILNot every detection is worth keeping. Dogs running fast produce motion blur. Distant dogs are only a few pixels. Some "dogs" are actually humans (YOLO occasionally misclassifies). This stage filters out low-quality detections.
Quality checks
- Blur detection — Uses the Laplacian variance method. Sharp images score above 100; blurry ones are discarded.
- Minimum size — Crops smaller than 64×64 pixels lack sufficient detail for breed classification.
- Bounding box expansion — The crop is padded by 25% on each side to avoid cutting off ears or tails.
When a dog triggers multiple frames in a single visit, the system retains only the sharpest image.
Stage 3: Breed Classification
wesleyacheng/dog-breeds-multiclass-image-classification-with-vit ↗This stage determines what kind of dog is in the frame. The system uses a Vision Transformer (ViT) model fine-tuned on dog breed images, capable of identifying 120+ breeds with high accuracy.
How ViT works
Unlike traditional CNNs that slide filters across an image, Vision Transformers divide the image into patches, flatten them into sequences, and process them with the same attention mechanism that powers GPT. Notably, the same architecture works effectively for both text and images.
Handling mixed breeds
Mixed breeds like goldendoodles, labradoodles, and bernedoodles are common but weren't in the training data. The model often hedges between parent breeds for these dogs, so the system retains the top-3 predictions to handle ambiguous cases.
Stage 4: Individual Dog Recognition
facebook/dinov2-base ↗ ArcFaceThis is the most sophisticated stage: recognizing that the golden retriever from Tuesday is the same one from last week. It works like facial recognition, applied to dogs.
Visual fingerprints with DINOv2
DINOv2 is a self-supervised vision model from Meta that learns to understand images without labels. It generates a 768-dimensional "embedding" for each dog—a mathematical fingerprint that captures what makes that dog visually unique.
Matching with ArcFace
Raw DINOv2 embeddings work reasonably well, but aren't optimal for re-identification. A projection head trained with ArcFace loss improves this—the same technique used in human face recognition systems. It compresses the 768 dimensions to 256 and spreads out similar-looking dogs in the embedding space.
Building dog profiles
Each recognized dog gets a profile that tracks: first and last visit dates, total visit count, all detected images, and the average breed confidence. Some dogs visit daily; others appear only once. Over time, the system builds familiarity with regular visitors.
Privacy First
openai/clip-vit-base-patch32 ↗The camera inevitably captures people as well—mail carriers, neighbors, pedestrians. To protect privacy, the system actively filters out any human detections.
Two-stage human filtering
- Geometric filter — Humans have different proportions than dogs. If a detection is tall and narrow (aspect ratio > 2.2), it's likely a person standing upright. This catches most cases instantly.
- CLIP verification — For ambiguous cases, OpenAI's CLIP model performs zero-shot classification: "Is this a dog or a human?" If human confidence exceeds 40%, the detection is deleted.
What gets deleted: Any detection flagged as human is immediately removed—both the image file and the database record. No human photos are retained, period.
GPU Acceleration with Apple Silicon
The entire pipeline runs on a Mac Mini M2 in a home server setup. Apple's Metal Performance Shaders (MPS) backend allows PyTorch to leverage the GPU, which significantly improves ML inference performance.
Why local processing matters
- Privacy — Camera footage never leaves the local network
- No API costs — Cloud vision APIs on every detection would be cost-prohibitive
- Speed — No network round-trips means faster processing
- Control — Full ownership of data and flexibility to adjust models
The Tech Stack
- Camera — Reolink RLC-810WA (any RTSP-compatible camera works)
- Object Detection — YOLOv5s (deployable via NVR, server, or edge device)
- Breed Classification — ViT fine-tuned on dog breeds
- Re-identification — DINOv2 + custom ArcFace head
- Human Filtering — CLIP zero-shot classification
- Backend — Python, Flask, SQLite
- Hardware — Mac Mini M2, runs 24/7
Run It Yourself
The system is open source. Anyone with a camera pointed at an area where dogs pass by can set up their own instance. Requirements:
- A compatible IP camera (most RTSP cameras work)
- YOLO object detection — options include an NVR like Scrypted, a separate server, or edge deployment
- A machine with a GPU for the ML pipeline (Apple Silicon, NVIDIA, etc.)
- Python 3.10+ and approximately 10GB of storage for models
The setup process requires some familiarity with Python and command-line tools. Work is ongoing to simplify deployment.