What it is
Face detection is the oldest neural-vision task that still earns its keep in production. Viola-Jones in 2001 made it tractable on a CPU; MTCNN in 2016 made it accurate; BlazeFace in 2019 made it run at 60fps on a phone. In 2026 a face detector is small, fast, and free — the kind of primitive you reach for without thinking, the way you'd reach for `Math.min`.
What it returns is purposefully thin: a list of detections, each with a bounding box in pixel coordinates, a confidence score between 0 and 1, and (in the modern detectors) six keypoints — right eye, left eye, nose tip, mouth centre, right ear, left ear. Not enough to recognise a face. Just enough to point at one. The thinness is the feature: the same six landmarks are all you need to align a portrait, anchor a spatial effect, place a lip-sync target, or reason about gaze direction.
It is not face recognition. It does not know who the person is, only that there is one. That distinction is doing a lot of ethical and architectural work — detection is a tiny, well-bounded primitive; recognition is a different system with very different stakes. The two get confused in popular discourse and conflated in surveillance products, but inside a creative pipeline they sit on opposite shelves.
Why it matters here
In the polyglot patch system that Gateway/Producer is reaching for, face detection is the first primitive that closed a real gap. Pilot 1 — atmospheric modulation around a subject — kept landing on the wrong anchor when we used segmentation alone: SAM 2 would happily return ten masks for a group photo, and the largest one was an arm or a piece of clothing, never the face. Subjects washed out. Background washed out wrongly. The image looked broken.
Adding a `find_faces` node to the graph fixed it in one hop. Detect faces; pick the largest, or the most central, or the one furthest back; that point becomes the spatial anchor for everything downstream. The same composition with the wrong anchor and the right anchor are different images entirely. This is what makes detection feel less like a feature and more like a basic primitive — once you have it, you stop choosing wrong.
Where face detection lives
Different communities have arrived at face detection from different doors — mobile AR, biometric KYC, photo apps, video calls — and built tools optimised for very different trade-offs.
MediaPipe Face Detection (BlazeFace)
Google's mobile-first detector. ~1MB TFLite model, ~100ms on CPU, 60fps on phone GPU. Six keypoints. What Gateway uses.
InsightFace / RetinaFace
The accuracy ceiling for crowded scenes. Five keypoints, optional 106-point face mesh, used in identity-matching pipelines.
OpenCV YuNet
Tiny model bundled with OpenCV since 4.5.4. The default for anyone who already has OpenCV in their stack.
Apple Vision Framework
On-device on iOS / macOS. Bounding box + landmarks + face quality score. Free if you live in the Apple ecosystem.
MTCNN
The 2016 CNN-cascade architecture that everyone learned face detection on. Still in many tutorials. Outclassed in 2026 but historically important.
WIDER FACE benchmark
The dataset every modern detector reports against. Easy / Medium / Hard splits — 'Hard' contains tiny, occluded, or off-axis faces. Read the Hard score, not the Easy one.
What this teaches Producer
- Detection and selection are different operations and should stay separate. The detector returns a candidate list. A separate `selectors` vocabulary picks one — `largest_area`, `most_central`, `lowest_z` (furthest back), `highest_x` (rightmost). The same selectors work on faces, on segmentation regions, on any future detection type. Once we made that split, the orchestrator could reason about anchor choice in plain English without ever knowing it was looking at a face detector versus a SAM mask versus a saliency peak.
- The right primitive is often smaller than you think. We considered using a vision LLM to identify subjects ("the bride", "the speaker"), then realised that for *anchoring* we don't need identity at all — we need a 2D point with depth. A face detector returns six points per face; that is more than enough. Reaching for a smaller, faster, more bounded primitive when one exists is almost always the right move. Save the vision LLM for the moments when the language matters.
- Defaults lie about the territory. MediaPipe's default detection threshold is 0.5 — perfectly reasonable for a single-subject portrait, badly wrong for a busy group photo. A confidence sweep on the Bradley Cooper Oscars selfie produced this curve: 0.5 → 4 faces (front row only); 0.3 → +Brad Pitt; 0.2 → 6 faces with NO false positives (sweet spot); 0.1 → +Lupita Nyong'o + Channing Tatum but noise begins to creep in; 0.05 → ~25% false positives; 0.01 → unusable. Defaults are tuned for the median request; creative work usually wants the tails. Surface every threshold as a knob, document the discoveries, and at low confidence compose a selector with a noise filter (`filter_min_area`, `filter_confidence`) so the candidate list stays useful.
- Local primitives change the rhythm. Face detection runs in ~100ms on a CPU and costs nothing per call. That is a different category of primitive from a Replicate call that takes 5–10 seconds and costs a cent. When something is free and instant, you stop being precious about it — you call it on every frame, on every variation, on every iteration. The polyglot bundle should mark which primitives are local-cheap so the orchestrator can call them speculatively, the same way the JVM treats a `Math` call differently from a network round-trip.
Try it yourself
- ai.google.dev/mediapipe/face_detector — Google's official documentation. Includes a working browser demo you can point at your webcam in five seconds.
- github.com/deepinsight/insightface — InsightFace + RetinaFace. The accuracy ceiling. Heavier than MediaPipe; worth it when faces are small or partially occluded.
- WIDER FACE benchmark — The dataset every face detector reports against. Look at the Hard split's example images to see what 'difficult' actually means.
In Gateway, the whole face-aware anchor sequence is three lines:
Graph()
.add(find_faces, name="faces", min_detection_confidence=0.2)
.add(pick_anchor, after=["faces", "depth"],
mode="select", source="faces", selector="largest_area")