ODSC AI East 2026
The Art of Clustering
The Good. The Bad. The Beautiful.
Seth Levine
ODSC AI East 2026
The Good. The Bad. The Beautiful.
Seth Levine
Run clustering twice.
Get two different answers.
Clustering doesn't discover structure.
It creates it.
Every step is a decision about what structure you're allowed to see.
Top by vote count. 120 years of cinema. Mixed languages, long-tail metadata.
Sparse metadata, missing overviews, broken IDs. The pipeline can't fix what isn't there.
Overviews and posters. Same dataset, two modalities — clusterable independently or fused.
all-MiniLM-L6-v2 · 384-D. Sentence meaning, not bag of words.
Surface tokens, not meaning. Hitman and assassin share zero words — TF-IDF can't see they mean the same thing.
CLIP encodes text and images into the same 512-D space. Average them per film → fused embedding.
n_neighbors · 2 → 60 → 2 · local ↔ globalmin_dist · 0.0 → 0.8 → 0.0 · tight ↔ spreadSweep — never set them by memory.
Skip dim reduction → density-based clustering has nothing to grip.
Half the catalog doesn't fit a tight theme.
That's honest, not a failure.
Same data. Different knob. Different story.
If the story changes with a knob turn, the story isn't about the data.
BERTopic ships a temporal view: topic prevalence across decades.
Hover the line for "WWII & Nazi Germany." Where does it start?
Looks right.
It isn't a WWII cluster. It's a war-shaped cluster.
Clustering captures meaning — not metadata.
Are you clustering to understand your data,
or to generate labels for a system?
Clustering → discover structure.
Classification → scale it.
What BERTopic shipped for years:
0_planet_earth_space_alien
3_queen_prince_princess_king
4_heist_bank_drug_police
The same clusters, re-labelled by Claude:
Space sci-fi
Royalty & fairy tales
Neo-noir crime
Nothing in the pipeline got smarter.
The labelling layer did.
5 zoom levels · search any title · click any point to open on TMDB.
~2× the signal of chance — and these clusters were never told what genre is.
Clustering finds the structure.
Classification — trained on the labels you wrote — applies it to the next 10 million rows.
Cluster to discover · → · classify to deploy.
Each point is the actual poster at its UMAP coordinate.
1,000 films · auto-rotates until you grab the camera.
You're not discovering structure.
You're designing a perspective.
Take these home
Thank you · github.com/splevine/clustering-good-bad-beautiful