The open-source stack behind the talk
Every library doing visible work in the demos, grouped by where it sits in the pipeline.
Pipeline cheat-sheet
requests, pandas, pyarrow, tqdmsentence-transformers, CLIPumap-learnhdbscan, evoc, scikit-learnbertopic, anthropicdatamapplot, matplotlib, imageio-ffmpeguv, jupyterlab, ColabEncoding & embeddings
sentence-transformersText encoder
Sentence and image embeddings on top of Hugging Face. Runs all-MiniLM-L6-v2 on the overviews and clip-ViT-B-32 on the posters.
01_the_good.ipynb, 03_the_beautiful.ipynb, scripts/embed_overviews.py
CLIP (ViT-B/32)Image encoder
Contrastive Language-Image Pretraining. Same pipeline, different encoder — the multimodal pivot in The Beautiful.
Used in:03_the_beautiful.ipynb
Dimensionality reduction
UMAPManifold
Leland McInnes's uniform manifold approximation. Two outputs from the same fit: 5-D for the clusterer to breathe in, 2-D for the maps.
Used in: all three notebooks,scripts/render_animations.py (the n_neighbors / min_dist sweeps)
Clustering
HDBSCANDensity
Hierarchical density-based clustering. No k, noise labelled -1 by design — the whole reason The Good works.
01_the_good.ipynb, transitively through BERTopic
EVoCDensity-tree
Embedding Vector Clustering — newer density-tree clusterer from the McInnes lab. Builds its own nearest-neighbor graph; used on poster embeddings in The Beautiful.
Used in:03_the_beautiful.ipynb
scikit-learnBaselines
The canonical ML toolkit. Used for k-means baselines, ARI/silhouette metrics, and the sensitivity/structureless-data demos in The Bad.
Used in:02_the_bad.ipynb
Topic modeling & labels
BERTopicTopics
Maarten Grootendorst's topic-modeling framework — wraps UMAP + HDBSCAN + c-TF-IDF into one pipeline. The "before LLMs" labels (0_planet_earth_space_alien) come from here.
01_the_good.ipynb, scripts/label_hierarchy.py, scripts/bertopic_visuals.py, scripts/topics_over_time.py
Claude (Anthropic)LLM labels
Frontier LLM used to relabel BERTopic clusters from keyword strings into human-readable themes (Space sci-fi, Neo-noir crime). Ran on Claude Haiku — ~155 labels for under $0.50.
Used in:scripts/label_hierarchy.py
Visualization
datamapplotMaps
The payoff renderer — interactive scatter maps with hover, search, zoom-dependent labels and the poster-thumbnail mode. Ships both the 5K and 100K maps.
Used in:scripts/static_map.py, poster constellation, 100K map
MatplotlibStatics
The static renderer — UMAP parameter-sweep frames and the 100K static map.
Used in:scripts/render_animations.py, scripts/static_map.py
imageio-ffmpegVideo
FFmpeg bridge that stitches the matplotlib frames into the MP4/GIF UMAP sweeps.
Used in:scripts/render_animations.py
Data & infrastructure
TMDBDataset
The Movie Database — source of the top 5,000 (and 100K) films by vote count: metadata, overviews, poster URLs. Free API, requires a Read Access Token.
Used in:scripts/fetch_movies.py, scripts/fetch_posters.py
pandas + PyArrowDataframes
Every intermediate artifact is a Parquet file — movies, embeddings, topic assignments. pandas for manipulation, PyArrow for the on-disk format.
Used everywhereuvEnv manager
Astral's fast Python package manager. One uv sync and you have the exact environment that rendered every artifact on the site.
pyproject.toml, uv.lock
JupyterLab + ColabNotebooks
JupyterLab locally, Colab for attendees — every notebook ships a Colab badge so the talk is one click from a running environment.
Used in: all notebooks