The Lab | Alt Shift Lab

What it is

A segmentation model takes one image and returns N masks — each a single-channel image the same size as the input, where white pixels are inside the region and black pixels are outside. A modern automatic model like SAM 2 will return between five and fifteen masks for a typical photograph, one per 'thing' it considers a coherent object: a face, a hat, a hand, a chair, a wall, a window. The masks are not exclusive — they may overlap, nest, or share boundaries.

What is missing from the output is also instructive. SAM 2 does not return labels. The model knows there is a region; it does not know what the region is. If you need 'the person', you have to identify it yourself — by spatial heuristics, by running a separate vision-language model over each mask's bounding box, or by switching to a text-promptable variant like GroundedSAM that takes 'a person' as input. The unlabelled-mask convention is a feature, not a limitation: it keeps the model fast, keeps the output language-agnostic, and lets the rest of the pipeline decide what counts as identity.

The history runs from edge-based segmentation in the 1980s, through graph cuts and watershed in the 2000s, through the deep-learning U-Net in 2015, to Segment Anything (SAM) in 2023 and SAM 2 in 2024. SAM 2 was the moment everything changed: a single foundation model that segments anything, from any image or video, with or without prompts, in a few seconds on a managed endpoint. The cost of a mask collapsed by an order of magnitude. The mental shift it enables is bigger than the latency improvement.

Why it matters here

Segmentation is the primitive that turns 'an image' into 'a graph of regions'. Every operation downstream of it can address a specific area instead of the whole frame. That is the difference between Photoshop in 1990 (paint anywhere) and Photoshop in 1995 (paint inside a selection) — and it is the same kind of leap when it shows up in a polyglot pipeline. The polyglot bundle now carries a `mask` slot for exactly this reason: once an image has been segmented, every downstream Node can decide whether to treat each region differently, which is the basis of compositing, per-object editing, layered generation, and addressable layout.

But segmentation is also where Pilot 1 found its first hard lesson. Iterating SAM regions and applying one effect strength per region produced two failure modes at once: an 88% pixel coverage gap, because SAM in automatic mode does not tile the image — it returns the most 'obvious' objects and leaves the rest untouched — and visible banding at mask boundaries because the strength jumped discretely between regions. The fix was to abandon per-region iteration and treat strength as a continuous field defined per pixel by depth and 2D distance to an anchor. Segmentation still mattered — it just stopped being the carrier of the modulation.

Where segmentation lives

Segmentation has more dialects than most extracted signals — semantic, instance, panoptic, interactive, text-promptable. Each community optimises for a different thing.

SAM 2 (Meta)

The 2024 foundation model. Automatic or prompted, image or video, on a managed endpoint. What Gateway uses for general segmentation.

GroundedSAM

SAM + a text-promptable detector. Ask for 'a person' or 'all the chairs' and get masks back. The bridge from language to region.

Mask2Former / Mask DINO

Universal segmentation networks that handle semantic, instance, and panoptic in one architecture. The research-lab default.

DeepLab v3+ / U-Net

The pre-foundation-model classics. Still used everywhere medical-imaging or industrial inspection lives.

MediaPipe Selfie Segmentation

On-device single-class segmentation for video calls. The 'blur the background of your Zoom' model.

Photoshop Subject Select / Magic Wand

The interactive UX lineage. SAM 2's prompted mode is the spiritual successor — click once, get the right mask.

What this teaches Producer

Automatic mode does not tile the image. SAM 2 in automatic mode is asking 'what are the obvious objects here?' not 'cover every pixel'. If the pipeline assumes total coverage, it will silently leave the majority of the frame untouched and the artefact will look broken in a way that takes a while to diagnose. The lesson: always check coverage explicitly, and design effects so that 'no mask covers this pixel' has a defined behaviour rather than a hidden one.
Discrete masks make poor carriers of continuous modulation. If you want an effect to fade with distance, the mask is the wrong abstraction — it only tells you 'in or out'. The right abstraction is a continuous field (a depth map, a distance transform, a strength map). Use segmentation to say 'this area exists', and a separate field to say 'how strongly does the effect apply here'. The two together are far more expressive than either alone.
Mask order is unstable; treat the list as a set. SAM 2 may return masks in a different order on a second run of the same image. Anything that keys behaviour on `masks[0]` will silently break. The right convention is to operate on the list as an unordered set and select what you want by spatial criteria — the largest, the most central, the one closest to a known point. That way the pipeline is robust to model-version changes and to the model's own non-determinism.
Detection without identity is the right default. SAM 2 returns regions, not labels. That separation is healthy — let one Node detect, let another (a vision LLM, or a heuristic, or a human in the loop) name. When a creative pipeline wants 'the person', it should compose a detection step with a naming step, not reach for a single oracle. The composability is what keeps the system extensible: a different namer (CLIP, Florence-2, Claude vision) can swap in without touching the segmentation.

Try it yourself

segment-anything.com/demo — Meta's official browser demo. Upload an image, click anywhere, get a mask. The fastest way to feel what this primitive does.
replicate.com/meta/sam-2 — The endpoint Gateway routes through. Has a playground; ~5 seconds and ~$0.01 per image.
GroundedSAM — If you need text prompts ('a hat', 'all the windows'), this is the canonical pairing of SAM with a Grounding DINO detector.

In Gateway, segmenting an image is one Node call:

Graph().add(sam2_segment, points_per_side=16)
# bundle.mask now carries N MaskRegions, each with bbox + Reference to the PNG

Composes with

Lyt Cnt Fd Dp

Other cells in the table — most still to be walked into in coming sessions.

Segmentation Mask

What it is

Why it matters here

Where segmentation lives

What this teaches Producer

Try it yourself