ODSC East 2026

ODSC AI East 2026

The Art of Clustering

The Good. The Bad. The Beautiful.

Seth Levine

Run clustering twice.

Get two different answers.

Clustering doesn't discover structure.
It creates it.

The map

A pipeline of decisions.

raw data encoding dim reduction clustering labels actions

Every step is a decision about what structure you're allowed to see.

Stage · raw data

What signal exists?

The Good

5,000 films from TMDB

Top by vote count. 120 years of cinema. Mixed languages, long-tail metadata.

→ enough breadth for real clusters to form.
The Bad

Junk in, junk out

Sparse metadata, missing overviews, broken IDs. The pipeline can't fix what isn't there.

→ clean before you cluster, not after.
The Beautiful

Two views, one film

Overviews and posters. Same dataset, two modalities — clusterable independently or fused.

→ multimodal raw data is a competitive moat.
Stage · encoding

What "similar" means.

The Good

Sentence transformers

all-MiniLM-L6-v2 · 384-D. Sentence meaning, not bag of words.

→ everything downstream inherits this lift.
The Bad

TF-IDF / bag of words

Surface tokens, not meaning. Hitman and assassin share zero words — TF-IDF can't see they mean the same thing.

→ what you encoded was never what you meant.
The Beautiful

Multimodal in one space

CLIP encodes text and images into the same 512-D space. Average them per film → fused embedding.

→ a third lens that resolves text/image disagreement.
Stage · dim reductionThe Good

Two knobs that matter.

n_neighbors · 2 → 60 → 2 · local ↔ global
min_dist · 0.0 → 0.8 → 0.0 · tight ↔ spread

Sweep — never set them by memory.

Stage · dim reductionThe Bad
0.47
raw 384-D · nearest/farthest
0.05
UMAP 5-D · same ratio

Skip dim reduction → density-based clustering has nothing to grip.

Stage · clusteringThe Good

No k. Density-based.

47
natural clusters
48.5%
noise — labelled −1 by design

Half the catalog doesn't fit a tight theme.
That's honest, not a failure.

Stage · clusteringThe Bad

Inception's neighbors.

k = 5

  • Shawshank Redemption
  • Django Unchained
  • Deadpool

k = 80

  • Fight Club
  • Kingsman
  • Matrix Reloaded

Same data. Different knob. Different story.
If the story changes with a knob turn, the story isn't about the data.

Stage · clustering · over time

Topics evolve.

BERTopic ships a temporal view: topic prevalence across decades.

Hover the line for "WWII & Nazi Germany." Where does it start?

Stage · clusteringThe Beautiful

Claude labelled this cluster
"WWII & Nazi Germany."

A few of its 1,063 films
  • Dunkirk2017
  • Schindler's List1993
  • The Bridge on the River Kwai1957
  • Casablanca1942

Looks right.

Stage · clusteringThe Beautiful · reveal

Now look at the rest of it.

Same cluster · earlier films
  • Grand Illusion1937
  • Triumph of the Will1935
  • Westfront 19181930
  • Wings1927
  • The Big Parade1925
  • I Accuse1919
  • Red Cross Ambulance on Battlefield1900
134
released before WWII
97
before Hitler took power
22
before WWI even began

It isn't a WWII cluster. It's a war-shaped cluster.

Clustering captures meaning — not metadata.

The critical distinction

Are you clustering to understand your data,
or to generate labels for a system?

Clustering → discover structure.
Classification → scale it.

Stage · labels · the pivot

What BERTopic shipped for years:

0_planet_earth_space_alien
3_queen_prince_princess_king
4_heist_bank_drug_police

The same clusters, re-labelled by Claude:

Space sci-fi
Royalty & fairy tales
Neo-noir crime

Nothing in the pipeline got smarter.
The labelling layer did.

Stage · labels · the lift at scale

5 zoom levels · search any title · click any point to open on TMDB.

Stage · actionsThe Good

Does it carry signal?

0.38
genre purity · our clusters
0.19
random baseline

~2× the signal of chance — and these clusters were never told what genre is.

Stage · actionsThe Beautiful

Hand off to scale.

Clustering finds the structure.
Classification — trained on the labels you wrote — applies it to the next 10 million rows.

Cluster to discover · · classify to deploy.

Same pipeline · CLIP instead of MiniLM

The poster constellation.

Each point is the actual poster at its UMAP coordinate.

Same pipeline · in 3D

The poster cosmos.

1,000 films · auto-rotates until you grab the camera.

You're not discovering structure.

You're designing a perspective.

Take these home

  1. Clusters ≠ truth
  2. Instability is signal
  3. Exploration ≠ labeling
  4. The goal is decisions

Thank you · github.com/splevine/clustering-good-bad-beautiful

space next · back · esc exit Slides · 1/22