How I built it
From movie metadata to semantic cartography.
The finished map looks simple: dots, territories, labels, search, and nearest neighbors. The hard part was making those pieces mean the same thing.
This project became less about drawing a pretty embedding and more about building a trustworthy semantic geography: a system where the data model, labels, borders, and user interface all describe the same underlying structure.
01
Start with a defensible data story
The first version used the official TMDb API to build controlled movie profiles from titles, overviews, genres, keywords, and short review-language snippets.
That made the project viable as a public portfolio piece: no Letterboxd scraping, no IMDb review scraping, no private datasets, and no raw review text exposed in the browser. But the early profiles were too thin. They produced some good neighbors, but not enough depth for the kind of vibe map the interface wanted.
02
Build semantic profiles, not keyword soup
The stronger version combined multiple public signals: TMDb metadata, MovieLens Tag Genome relevance scores, MPST plot tags and synopses, controlled synopsis texture, and light review language.
The key lesson was that more text was not automatically better. Heavy review text added noise. Title-heavy profiles created leakage. The best representation balanced plot, theme, tone, audience perception, and genre without letting any single source dominate.
Each film became a semantic profile: a compact description of what the movie is about, how it feels, and what viewers tend to associate with it.
03
Keep similarity in the full embedding space
Each semantic profile was embedded into a high-dimensional vector. Nearest neighbors are calculated there, where the semantic signal is strongest. The visible map is a projection. It is useful for exploration, but it is not treated as the source of truth.
That distinction mattered. A raw 2D projection can make good semantic neighbors appear far apart, or make nearby dots look more related than they really are. The atlas needed to be visually readable without pretending that screen distance explains everything.
04
Make the hierarchy real
The biggest early failure was structural. Macro regions, neighborhoods, and microclusters originally behaved like separate layers that looked nested in the UI but were not always truly nested in the data.
The final system uses strict hierarchical clustering: cluster the full set into macro regions, cluster within each macro region into neighborhoods, then cluster within each neighborhood into microclusters. That means every film has one real path through the atlas: macro region -> neighborhood -> microcluster.
The geography is not just a visual effect. It is enforced by the data model.
05
Make labels poetic but accountable
The labels are the product surface. They are how people understand the map.
I used LLM-assisted labeling to generate names for regions, neighborhoods, and microclusters, but the goal was not to blindly accept whatever sounded clever. Labels had to be vivid enough to feel human and specific enough to be useful.
The guiding phrase became: poetic but accountable. A good label should be memorable, but it should also be supported by representative films, tags, cluster evidence, and nearest-neighbor behavior.
06
Turn the frontend into a QA surface
The interface exposed problems the pipeline alone could hide. Some labels overpromised. Some borders implied relationships that were not true. Some nearest neighbors were semantically good but visually surprising. Some map experiments looked beautiful but were less honest.
Search, selected-film panels, expandable cluster cards, and nearest-neighbor links became more than UI features. They became audit tools.
The frontend became a cartography problem: how do you show high-dimensional semantic relationships in a way that feels readable, stable, and honest?
Validation
Subjective quality became measurable enough to improve.
A semantic map is partly subjective, so validation could not be purely mathematical. I used a mix of known-film spot checks, cluster-label review, structural scans, large LLM-assisted audits, and targeted repairs.
The technical checks focused on whether the hierarchy was real, not merely visual: every movie needed one valid macro region, one neighborhood inside that macro region, and one microcluster inside that neighborhood. The export was also scanned for duplicate path labels, missing cluster assignments, malformed neighbor lists, and accidental leakage of private artifacts.
Nearest-neighbor quality was checked separately from screen position. That mattered because the map is a projection: a point can look distant on the canvas while still being close in the full embedding space. Validation therefore compared selected films against their semantic neighbors, cluster memberships, labels, and visible territory placement as separate signals.
Browser QA became part of the validation loop too. Search, zoom, selected-film panels, expandable cluster cards, labels, territory boundaries, and console health were all inspected against the sanitized static export so the public page stayed aligned with the offline pipeline.
What changed
The project moved from "more data will fix it" to "the structure has to be true." It moved from flat cluster labels to telescoping region names, from a raw scatterplot to a semantic atlas, and from one-off model output to an audited public product.
What worked
Official and public data sources, balanced semantic profiles, full-space nearest-neighbor computation, strict hierarchy, human-reviewable labels, targeted repairs, and browser QA made the system legible without requiring a backend.
What did not
Heavy review text, title-heavy embeddings, non-nested clustering, purely quantitative cluster selection, and decorative map borders all created attractive but less trustworthy versions of the atlas.