The Film Atlas

From movie metadata to semantic cartography.

The finished map looks simple: dots, territories, labels, search, and nearest neighbors. The hard part was making those pieces mean the same thing.

This project became less about drawing a pretty embedding and more about building a trustworthy semantic geography: a system where the data model, labels, borders, and user interface all describe the same underlying structure.

Start with a defensible data story

The first version used the official TMDb API to build controlled movie profiles from titles, overviews, genres, keywords, and short review-language snippets.

That made the project publishable: no Letterboxd scraping, no IMDb review scraping, no private datasets, and no raw review text exposed in the browser. But the early profiles were too thin. They produced some good neighbors, but not enough depth for the kind of vibe map the interface wanted.

Build semantic profiles, not keyword soup

The stronger version combined multiple public signals: TMDb metadata, MovieLens Tag Genome relevance scores, MPST plot tags and synopses, controlled synopsis texture, and light review language.

The key lesson was that more text was not automatically better. Heavy review text added noise. Title-heavy profiles created leakage. The best representation balanced plot, theme, tone, audience perception, and genre without letting any single source dominate.

Each film became a semantic profile: a compact description of what the movie is about, how it feels, and what viewers tend to associate with it.

Keep similarity in the full embedding space

Each semantic profile was embedded into a high-dimensional vector. Nearest neighbors are calculated there, where the semantic signal is strongest. The visible map is a projection. It is useful for exploration, but it is not treated as the source of truth.

That distinction mattered. A raw 2D projection can make good semantic neighbors appear far apart, or make nearby dots look more related than they really are. The atlas needed to be visually readable without pretending that screen distance explains everything.

Make the hierarchy real

The biggest early failure was structural. Macro regions, neighborhoods, and micro clusters originally behaved like separate layers that looked nested in the UI but were not always truly nested in the data.

The final system uses strict hierarchical clustering: cluster the full set into macro regions, cluster within each macro region into neighborhoods, then cluster within each neighborhood into micro clusters. That means every film has one real path through the atlas: macro region -> neighborhood -> micro cluster.

The geography is not just a visual effect. It is enforced by the data model.

Make labels poetic but accountable

The labels are the product surface. They are how people understand the map.

I used LLM-assisted labeling to generate names for regions, neighborhoods, and micro clusters, but the goal was not to blindly accept whatever sounded clever. Labels had to be vivid enough to feel human and specific enough to be useful.

The guiding phrase became: poetic but accountable. A good label should be memorable, but it should also be supported by representative films, tags, cluster evidence, and nearest-neighbor behavior.

Turn the frontend into a QA surface

The interface exposed problems the pipeline alone could hide. Some labels overpromised. Some borders implied relationships that were not true. Some nearest neighbors were semantically good but visually surprising. Some map experiments looked beautiful but were less honest.

Search, selected-film panels, expandable cluster cards, and nearest-neighbor links became more than UI features. They became audit tools.

The frontend became a cartography problem: how do you show high-dimensional semantic relationships in a way that feels readable, stable, and honest?

Area	Tradeoff	Final choice
Data	Richer sources improved semantic texture, but scraped review ecosystems would weaken the public data story.	Use TMDb plus public MovieLens and MPST signals; avoid Letterboxd and IMDb scraping.
Profiles	More text was not automatically better; title-heavy or review-heavy profiles could create leakage and noise.	Build balanced semantic profiles from plot, theme, tone, tags, and controlled review/synopsis texture.
Clustering	Flat clusters were easier to generate, but made the UI imply a hierarchy the data did not fully support.	Use a strict macro region -> neighborhood -> micro cluster hierarchy so each film has one real atlas path.
Methods	A beautiful 2D projection can hide or distort high-dimensional relationships.	Compute neighbors and clusters in full embedding space; use the map as an explorable projection.
Labels	Purely literal labels felt flat; purely poetic labels could overstate the evidence.	Use LLM-assisted labels that are vivid, inspectable, and accountable to cluster evidence.
Validation	Semantic quality is partly subjective, so a single metric would be false confidence.	Combine structural scans, spot checks, large audits, failure-set rechecks, and browser QA.
UI	More explanation in the map surface added cognitive weight.	Keep the interaction lightweight and move deeper methodology lower on the page.
Frontend	A backend could make exploration more dynamic, but would add operational and privacy surface area.	Ship a static Astro + Canvas experience backed by sanitized JSON and lazy neighbor shards.

Area

Tradeoff

Final choice

Data

Richer sources improved semantic texture, but scraped review ecosystems would weaken the public data story.

Use TMDb plus public MovieLens and MPST signals; avoid Letterboxd and IMDb scraping.

Profiles

More text was not automatically better; title-heavy or review-heavy profiles could create leakage and noise.

Build balanced semantic profiles from plot, theme, tone, tags, and controlled review/synopsis texture.

Clustering

Flat clusters were easier to generate, but made the UI imply a hierarchy the data did not fully support.

Use a strict macro region -> neighborhood -> micro cluster hierarchy so each film has one real atlas path.

Methods

A beautiful 2D projection can hide or distort high-dimensional relationships.

Compute neighbors and clusters in full embedding space; use the map as an explorable projection.

Labels

Purely literal labels felt flat; purely poetic labels could overstate the evidence.

Use LLM-assisted labels that are vivid, inspectable, and accountable to cluster evidence.

Validation

Semantic quality is partly subjective, so a single metric would be false confidence.

Combine structural scans, spot checks, large audits, failure-set rechecks, and browser QA.

More explanation in the map surface added cognitive weight.

Keep the interaction lightweight and move deeper methodology lower on the page.

Frontend

A backend could make exploration more dynamic, but would add operational and privacy surface area.

Ship a static Astro + Canvas experience backed by sanitized JSON and lazy neighbor shards.

The Film Atlas

A map of movie meaning, not just movie categories.

What if movies had microgenres like music?

From movie metadata to semantic cartography.

Start with a defensible data story

Build semantic profiles, not keyword soup

Keep similarity in the full embedding space

Make the hierarchy real

Make labels poetic but accountable

Turn the frontend into a QA surface

The public atlas is static. The semantic system is built offline.

What improved the atlas, and what did not.

Subjective quality became measurable enough to improve.

What changed

What worked

What did not