Stack · libraries & tooling

The open-source stack behind the talk

Every library doing visible work in the demos, grouped by where it sits in the pipeline.

← back to the main page

Pipeline cheat-sheet

raw datarequests, pandas, pyarrow, tqdm
encodingsentence-transformers, CLIP
dim reductionumap-learn
clusteringhdbscan, evoc, scikit-learn
topic modeling & labelsbertopic, anthropic
visualizationdatamapplot, matplotlib, imageio-ffmpeg
environmentuv, jupyterlab, Colab

Encoding & embeddings

sentence-transformersText encoder

Sentence and image embeddings on top of Hugging Face. Runs all-MiniLM-L6-v2 on the overviews and clip-ViT-B-32 on the posters.

Used in: 01_the_good.ipynb, 03_the_beautiful.ipynb, scripts/embed_overviews.py

CLIP (ViT-B/32)Image encoder

Contrastive Language-Image Pretraining. Same pipeline, different encoder — the multimodal pivot in The Beautiful.

Used in: 03_the_beautiful.ipynb

Dimensionality reduction

UMAPManifold

Leland McInnes's uniform manifold approximation. Two outputs from the same fit: 5-D for the clusterer to breathe in, 2-D for the maps.

Used in: all three notebooks, scripts/render_animations.py (the n_neighbors / min_dist sweeps)

Clustering

HDBSCANDensity

Hierarchical density-based clustering. No k, noise labelled -1 by design — the whole reason The Good works.

Used in: 01_the_good.ipynb, transitively through BERTopic

EVoCDensity-tree

Embedding Vector Clustering — newer density-tree clusterer from the McInnes lab. Builds its own nearest-neighbor graph; used on poster embeddings in The Beautiful.

Used in: 03_the_beautiful.ipynb

scikit-learnBaselines

The canonical ML toolkit. Used for k-means baselines, ARI/silhouette metrics, and the sensitivity/structureless-data demos in The Bad.

Used in: 02_the_bad.ipynb

Topic modeling & labels

BERTopicTopics

Maarten Grootendorst's topic-modeling framework — wraps UMAP + HDBSCAN + c-TF-IDF into one pipeline. The "before LLMs" labels (0_planet_earth_space_alien) come from here.

Used in: 01_the_good.ipynb, scripts/label_hierarchy.py, scripts/bertopic_visuals.py, scripts/topics_over_time.py

Claude (Anthropic)LLM labels

Frontier LLM used to relabel BERTopic clusters from keyword strings into human-readable themes (Space sci-fi, Neo-noir crime). Ran on Claude Haiku — ~155 labels for under $0.50.

Used in: scripts/label_hierarchy.py

Visualization

datamapplotMaps

The payoff renderer — interactive scatter maps with hover, search, zoom-dependent labels and the poster-thumbnail mode. Ships both the 5K and 100K maps.

Used in: scripts/static_map.py, poster constellation, 100K map

MatplotlibStatics

The static renderer — UMAP parameter-sweep frames and the 100K static map.

Used in: scripts/render_animations.py, scripts/static_map.py

imageio-ffmpegVideo

FFmpeg bridge that stitches the matplotlib frames into the MP4/GIF UMAP sweeps.

Used in: scripts/render_animations.py

PlotlyBERTopic charts

Backs BERTopic's native charts (visualize_topics_over_time, _heatmap, _barchart, _hierarchy) — the whole extras page.

Used transitively through BERTopic

Data & infrastructure

TMDBDataset

The Movie Database — source of the top 5,000 (and 100K) films by vote count: metadata, overviews, poster URLs. Free API, requires a Read Access Token.

Used in: scripts/fetch_movies.py, scripts/fetch_posters.py

pandas + PyArrowDataframes

Every intermediate artifact is a Parquet file — movies, embeddings, topic assignments. pandas for manipulation, PyArrow for the on-disk format.

Used everywhere

uvEnv manager

Astral's fast Python package manager. One uv sync and you have the exact environment that rendered every artifact on the site.

Used in: pyproject.toml, uv.lock

JupyterLab + ColabNotebooks

JupyterLab locally, Colab for attendees — every notebook ships a Colab badge so the talk is one click from a running environment.

Used in: all notebooks