Film Microgenres Mapped

The Film Atlas

An interactive map of 10,000 of the most popular films on TMDb, released from 1980 onward and arranged by experiential similarity: story, mood, texture, theme, and audience perception, not genre alone.

Movies
10,000
Built
June 2026
Macro clusters

How to read it

A map of movie meaning, not just movie categories.

Every dot is one film. The atlas maps 10,000 popular TMDb movies released from 1980 onward. At the widest zoom, it shows broad cinematic territories: crime worlds, family animation, dystopian futures, romance and becoming, survival dread, mythic spectacle, and more.

Zoom in and those territories split into neighborhoods. Zoom further and neighborhoods reveal microclusters: smaller groups of films with unusually similar themes, tone, story mechanics, or audience vibe.

Search for a film, click a dot, or follow a nearest neighbor to move through the map. The side panel shows the selected movie's macro region, neighborhood, microcluster, and closest semantic neighbors.

Visual distance is a guide, not the whole truth. Nearest neighbors are calculated in the full semantic embedding space; the map is a browser-rendered projection designed to make that high-dimensional structure explorable.

Inspiration

What if movies had microgenres like music?

The idea started with the pleasure of Spotify microgenres: those oddly specific labels that make taste feel less like a dropdown menu and more like a living cultural map.

Movies already have genres, but genre is blunt. A film can be comedy, horror, romance, or sci-fi and still belong to a stranger experiential family: lonely future intimacy, corporate satire, cozy mystery, survival dread, mythic spectacle, suburban paranoia, or slacker noir.

The Film Atlas asks whether cinema can be organized the way people often discover music: by texture, mood, story shape, and felt similarity. It is not a replacement for genre. It is a second map underneath it.

How I built it

From movie metadata to semantic cartography.

The finished map looks simple: dots, territories, labels, search, and nearest neighbors. The hard part was making those pieces mean the same thing.

This project became less about drawing a pretty embedding and more about building a trustworthy semantic geography: a system where the data model, labels, borders, and user interface all describe the same underlying structure.

01

Start with a defensible data story

The first version used the official TMDb API to build controlled movie profiles from titles, overviews, genres, keywords, and short review-language snippets.

That made the project viable as a public portfolio piece: no Letterboxd scraping, no IMDb review scraping, no private datasets, and no raw review text exposed in the browser. But the early profiles were too thin. They produced some good neighbors, but not enough depth for the kind of vibe map the interface wanted.

02

Build semantic profiles, not keyword soup

The stronger version combined multiple public signals: TMDb metadata, MovieLens Tag Genome relevance scores, MPST plot tags and synopses, controlled synopsis texture, and light review language.

The key lesson was that more text was not automatically better. Heavy review text added noise. Title-heavy profiles created leakage. The best representation balanced plot, theme, tone, audience perception, and genre without letting any single source dominate.

Each film became a semantic profile: a compact description of what the movie is about, how it feels, and what viewers tend to associate with it.

03

Keep similarity in the full embedding space

Each semantic profile was embedded into a high-dimensional vector. Nearest neighbors are calculated there, where the semantic signal is strongest. The visible map is a projection. It is useful for exploration, but it is not treated as the source of truth.

That distinction mattered. A raw 2D projection can make good semantic neighbors appear far apart, or make nearby dots look more related than they really are. The atlas needed to be visually readable without pretending that screen distance explains everything.

04

Make the hierarchy real

The biggest early failure was structural. Macro regions, neighborhoods, and microclusters originally behaved like separate layers that looked nested in the UI but were not always truly nested in the data.

The final system uses strict hierarchical clustering: cluster the full set into macro regions, cluster within each macro region into neighborhoods, then cluster within each neighborhood into microclusters. That means every film has one real path through the atlas: macro region -> neighborhood -> microcluster.

The geography is not just a visual effect. It is enforced by the data model.

05

Make labels poetic but accountable

The labels are the product surface. They are how people understand the map.

I used LLM-assisted labeling to generate names for regions, neighborhoods, and microclusters, but the goal was not to blindly accept whatever sounded clever. Labels had to be vivid enough to feel human and specific enough to be useful.

The guiding phrase became: poetic but accountable. A good label should be memorable, but it should also be supported by representative films, tags, cluster evidence, and nearest-neighbor behavior.

06

Turn the frontend into a QA surface

The interface exposed problems the pipeline alone could hide. Some labels overpromised. Some borders implied relationships that were not true. Some nearest neighbors were semantically good but visually surprising. Some map experiments looked beautiful but were less honest.

Search, selected-film panels, expandable cluster cards, and nearest-neighbor links became more than UI features. They became audit tools.

The frontend became a cartography problem: how do you show high-dimensional semantic relationships in a way that feels readable, stable, and honest?

Validation

Subjective quality became measurable enough to improve.

A semantic map is partly subjective, so validation could not be purely mathematical. I used a mix of known-film spot checks, cluster-label review, structural scans, large LLM-assisted audits, and targeted repairs.

The technical checks focused on whether the hierarchy was real, not merely visual: every movie needed one valid macro region, one neighborhood inside that macro region, and one microcluster inside that neighborhood. The export was also scanned for duplicate path labels, missing cluster assignments, malformed neighbor lists, and accidental leakage of private artifacts.

Nearest-neighbor quality was checked separately from screen position. That mattered because the map is a projection: a point can look distant on the canvas while still being close in the full embedding space. Validation therefore compared selected films against their semantic neighbors, cluster memberships, labels, and visible territory placement as separate signals.

Browser QA became part of the validation loop too. Search, zoom, selected-film panels, expandable cluster cards, labels, territory boundaries, and console health were all inspected against the sanitized static export so the public page stayed aligned with the offline pipeline.

What changed

The project moved from "more data will fix it" to "the structure has to be true." It moved from flat cluster labels to telescoping region names, from a raw scatterplot to a semantic atlas, and from one-off model output to an audited public product.

What worked

Official and public data sources, balanced semantic profiles, full-space nearest-neighbor computation, strict hierarchy, human-reviewable labels, targeted repairs, and browser QA made the system legible without requiring a backend.

What did not

Heavy review text, title-heavy embeddings, non-nested clustering, purely quantitative cluster selection, and decorative map borders all created attractive but less trustworthy versions of the atlas.

Movie metadata is sourced from TMDB. This product uses the TMDB API but is not endorsed or certified by TMDB.