The Lab | Alt Shift Lab

What it is

Monocular depth estimation is the trick of recovering relative depth from a single image — without stereo cameras, without LiDAR, without motion. The model has learned, from millions of images of the world, what kinds of pixels are usually closer than what other kinds of pixels. A railing in front of a wall. A face in front of a hallway. A hand reaching toward the lens. The output is a per-pixel grayscale map at the same resolution as the input, where the value at each pixel encodes 'how close to the camera this is, relative to the rest of the image'.

The convention to know: this is inverse depth, not metric depth. A pixel value of 200 does not mean '200 metres'; it means 'relatively close compared to a pixel value of 50'. The map is consistent within an image and useful for ordering ('what is in front of what?') and for continuous gradients ('how much closer is A than B?'). It is not useful for measurements. If you need 'this object is 2.4 metres from the camera', you need a calibrated stereo rig or a depth sensor; if you need 'this object is in the foreground and that one is in the background', monocular depth is exactly the right tool.

The pace of progress in this corner of computer vision has been quietly extraordinary. MiDaS in 2019 was the first model that worked across arbitrary scenes. DPT in 2021 made it sharper. ZoeDepth in 2023 added partial metric calibration. Depth-Anything-v2 in 2024 — the model Gateway uses — is more accurate at fewer parameters than anything before it, and runs in a few seconds on a small managed endpoint. What used to require expensive hardware and a calibration session now costs less than a cent and arrives in five seconds.

Why it matters here

The depth map is the most-reached-for primitive in Pilot 1. Every spatial modulation pattern — atmospheric perspective, subject isolation, dream halo, depth-band stylisation, photo-to-drawing transition — uses it. The reason is structural: depth is what gives a 2D image a continuous third dimension to modulate against. Without it, your only spatial knob is 2D distance from a point in the image plane, which collapses foreground and background into the same circle. With it, you can decay an effect with depth, peak it at a specific depth band, or invert it entirely — keep the back of the scene sharp and let the foreground dissolve.

Beyond modulation, the depth map is reusable across the rest of the polyglot bundle in ways that compound. A ControlNet conditioning pass takes it as input and generates a new image with the same spatial layout but a different style. A face anchor uses it to know which face is furthest back. A normal map can be derived from it. A point cloud can be lifted from it. The same five-second, one-cent call powers half a dozen downstream Nodes — which is exactly the property that makes a primitive worth treating as a first-class slot in the bundle rather than something each Node fetches for itself.

Where depth lives

Different communities have arrived at depth from different sensors and different needs — robotics, AR, photography, film. Each has its own dialect.

Depth-Anything-v2

The 2024 monocular SOTA. Three model sizes (Small / Base / Large) trading latency for detail. What Gateway uses.

MiDaS / DPT

The classics. MiDaS opened the field in 2019; DPT made it sharper. Still in many tutorials and many production pipelines.

ZoeDepth

Adds partial metric calibration on top of relative depth. Useful when you want both ordering and approximate metres.

Marigold

Diffusion-based depth estimation. Slower than DAv2 but produces unusually clean edges. Worth the extra latency for hero shots.

Apple LiDAR (iPhone Pro)

True hardware depth on every phone since 2020. Photographer's secret weapon for portrait mode and 3D scanning apps.

Stereo / RealSense / Kinect

Calibrated multi-camera setups. Metric depth, useful for robotics and volumetric capture; not relevant for working from a single found image.

What this teaches Producer

A continuous field is more powerful than a discrete mask. Depth gives you 'how much' instead of 'in or out'. Once you have a continuous field per pixel, every effect becomes a function of position rather than a binary application — you stop choosing which regions get the effect and start choosing how the effect varies smoothly across the frame. Producer's modulation vocabulary should default to continuous fields and treat masks as a special case (a step function), not the other way around.
Inverse-depth convention is a subtle gotcha. 'Lighter = closer' is the opposite of what most engineers initially write down ('depth is distance from camera, so far away = bigger number'). Half the bugs in early depth pipelines come from this sign error. The right move is to centralise the convention in one place — the depth slot's documented contract — and have every downstream Node trust it. The alternative (each Node deciding) is how a codebase ends up with three nodes that disagree about whether 0.0 is the camera or the horizon.
Edges align approximately, not exactly. The depth map's transitions are smoothed over a few pixels — the boundary of a face in the depth map is not the boundary of the face in the photograph. If a downstream Node needs crisp per-object depth, it should combine the depth map with a segmentation mask: take the mean depth value across the masked region rather than reading the depth at a single pixel. The two slots together are sharper than either alone.
Decompose once, modulate many. The two-phase Graph pattern in Pilot 1 — run depth + segmentation + face detection once, save the bundle to disk, then run dozens of modulation experiments without paying for the decomposition again — only works because depth is reusable. Once a primitive's output is a typed slot in the bundle, it becomes free to reach for from every downstream variant. The unit cost of creative iteration drops to near zero. This is the loop a polyglot patch system makes possible and a one-shot pipeline forecloses.

Try it yourself

huggingface.co/spaces/depth-anything/Depth-Anything-V2 — Official browser demo from the Depth-Anything-v2 authors. Drop in any image, see the map back in seconds.
replicate.com/chenxwh/depth-anything-v2 — The endpoint Gateway routes through. Has a playground; ~3–6 seconds and ~$0.005 per image.
depth-anything-v2.github.io — Project page. Worth reading the comparison gallery against MiDaS and the model architecture page if you want to know why this generation jumped so far.

In Gateway, the depth pass is one Node call:

Graph().add(depth_anything, model_size="Large")
# bundle.depth now holds a Reference to the grayscale PNG —
# reusable across every modulation, conditioning, and anchoring step downstream

Composes with

Cnt Lid Nm

Other cells in the table — most still to be walked into in coming sessions.

Depth Map

What it is

Why it matters here

Where depth lives

What this teaches Producer

Try it yourself