Stack · libraries & tooling

The open-source stack behind the talk

Every library doing visible work in the demos, grouped by where it sits in the pipeline.

← back to the main page

Pipeline cheat-sheet

raw data — requests, pandas, pyarrow, tqdm

encoding — sentence-transformers, CLIP

dim reduction — umap-learn

clustering — hdbscan, evoc, scikit-learn

topic modeling & labels — bertopic, anthropic

visualization — datamapplot, bokeh, deck.gl, three.js, matplotlib, imageio-ffmpeg

environment — uv, jupyterlab, Colab

Encoding & embeddings

sentence-transformersText encoder

Sentence and image embeddings on top of Hugging Face. Runs all-MiniLM-L6-v2 on the overviews and clip-ViT-B-32 on the posters.

Used in: 01_the_good.ipynb, 03_the_beautiful.ipynb, scripts/embed_overviews.py

CLIP (ViT-B/32)Image encoder

Contrastive Language-Image Pretraining. Same pipeline, different encoder — the multimodal pivot in The Beautiful.

Used in: 03_the_beautiful.ipynb

Dimensionality reduction

UMAPManifold

Leland McInnes's uniform manifold approximation. Two outputs from the same fit: 5-D for the clusterer to breathe in, 2-D for the maps.

Used in: all three notebooks, scripts/render_animations.py (the n_neighbors / min_dist sweeps) paper (McInnes et al. 2018) · source

Approximate nearest neighbors

pyNNDescentANN graph

Fast approximate nearest-neighbor search. The invisible workhorse — UMAP and EVoC both lean on it to build the k-NN graph that the rest of the stack runs on top of.

Used transitively through UMAP & EVoC source · NN-Descent paper (Dong et al. 2011)

Clustering

HDBSCANDensity

Hierarchical density-based clustering. No k, noise labelled -1 by design — the whole reason The Good works.

Used in: 01_the_good.ipynb, transitively through BERTopic original paper (Campello et al. 2013) · source · scikit-learn built-in (since 1.3)

EVoCDensity-tree

Embedding Vector Clustering — newer density-tree clusterer from the McInnes lab. Builds its own nearest-neighbor graph; used on poster embeddings in The Beautiful.

Used in: 03_the_beautiful.ipynb source & examples · Tutte Institute (same lab as UMAP & HDBSCAN)

scikit-learnBaselines

The canonical ML toolkit. Used for k-means baselines, ARI/silhouette metrics, and the sensitivity/structureless-data demos in The Bad.

Used in: 02_the_bad.ipynb

Topic modeling & labels

BERTopicTopics

Maarten Grootendorst's topic-modeling framework — wraps UMAP + HDBSCAN + c-TF-IDF into one pipeline. The "before LLMs" labels (0_planet_earth_space_alien) come from here.

Used in: 01_the_good.ipynb, scripts/label_hierarchy.py, scripts/bertopic_visuals.py, scripts/topics_over_time.py paper (Grootendorst 2022) · source

Claude (Anthropic)LLM labels

Frontier LLM used to relabel BERTopic clusters from keyword strings into human-readable themes (Space sci-fi, Neo-noir crime). Ran on Claude Haiku — ~155 labels for under $0.50.

Used in: scripts/label_hierarchy.py

Visualization

datamapplotMaps

The payoff renderer — interactive scatter maps with hover, search, zoom-dependent labels and the poster-thumbnail mode. Ships both the 5K and 100K maps.

Used in: scripts/static_map.py, poster constellation, 100K map source · talk: McInnes, "Data Mapping for Data Exploration" (PyData Seattle 2023)

MatplotlibStatics

The static renderer — UMAP parameter-sweep frames and the 100K static map.

Used in: scripts/render_animations.py, scripts/static_map.py

imageio-ffmpegVideo

FFmpeg bridge that stitches the matplotlib frames into the MP4/GIF UMAP sweeps.

Used in: scripts/render_animations.py

PlotlyBERTopic charts

Backs BERTopic's native charts (visualize_topics_over_time, _heatmap, _barchart, _hierarchy) — the whole extras page.

Used transitively through BERTopic

Bokeh2D interactive

Browser-native interactive plotting. Powers the 2D poster constellation (map_posters.html) — each point a textured glyph with hover tooltips, served as a self-contained HTML bundle.

Used in: map_posters.html (Bokeh 3.9 served from CDN) docs · source

deck.glWebGL maps

WebGL layer renderer underneath datamapplot's interactive maps. Handles the smooth zoom and pan on the 5K and 100K thematic constellations.

Used transitively through datamapplot in map.html, map_100k.html docs · source

three.js3D scene

WebGL scene graph behind the 3D poster cosmos. Textured sprites at UMAP coordinates, orbit controls, the auto-rotating camera — the visual finale of the talk.

Used in: cosmos.html (three.js 0.160 + OrbitControls) docs · source

Data & infrastructure

TMDBDataset

The Movie Database — source of the top 5,000 (and 100K) films by vote count: metadata, overviews, poster URLs. Free API, requires a Read Access Token.

Used in: scripts/fetch_movies.py, scripts/fetch_posters.py

pandas + PyArrowDataframes

Every intermediate artifact is a Parquet file — movies, embeddings, topic assignments. pandas for manipulation, PyArrow for the on-disk format.

Used everywhere

uvEnv manager

Astral's fast Python package manager. One uv sync and you have the exact environment that rendered every artifact on the site.

Used in: pyproject.toml, uv.lock

JupyterLab + ColabNotebooks

JupyterLab locally, Colab for attendees — every notebook ships a Colab badge so the talk is one click from a running environment.

Used in: all notebooks