Pipeline — Gateway Artists Between Music Genres

Organized, phase-by-phase copy of every script used to produce the results in the final report (latex/final/final.tex). The folders follow the report's narrative order, so a reader who has only seen the report can open the matching folder and find the scripts that produced each section.

This is a reference / documentation layout. The scripts are copies of the originals in code/ and code2/; data files, parquets and graphmls are not copied (they are large and live next to the originals). Paths inside the scripts still point at the original working directories.

The two-graph idea (so the phases make sense)

The whole project is built on two graphs over the same set of artists (and, secondarily, albums):

Similarity graph (undirected): edge weight = how much two artists share an audience. Used to approximate genres via community detection.
Flow graph (directed): edge A → B = a user listened to B right after A. Used to measure listeners crossing between those genres — the gateway signal.

Phases

Folder	Report section	What happens here
`01_data_collection/`	§2.1	Download the MLHD+ dataset and inventory users.
`02_metadata_resolution/`	§2.2	Resolve/canonicalize MBIDs to names via a local MusicBrainz mirror.
`03_data_cleaning/`	§2.1	Filter bots/low-density users, sample 10% → 52,800 users.
`04_graph_construction/`	§3	Build the similarity graph (cosine-log) and the directed flow graph.
`05_community_detection/`	§4, §6	Sparsify (L-Spar+SK), fit the nested DC-SBM, compare alternatives, name genres.
`06_gateway_scoring/`	§5	Score gateway artists on the flow graph via source-genre entropy.
`07_visualization/`	§6.4	Interactive Cytoscape.js gateway map (DC-SBM ℓ1 artist genres).
`utils/`	—	Standalone inspection/debug helpers used during development.

Each folder has its own README.md describing every script in it and pointing at the relevant report section, algorithm, table or figure.

End-to-end flow

01  downloader.py ........... fetch MLHD+ tarballs
02  build_lookup_parquets.py  MBID -> canonical name lookups (local MB mirror)
03  filter_users.py + sample_filtered_users.py  -> 52,800 users
04  gpu_build_graph_coslog.py  similarity graph   (Alg. 1)
    flow-graph.py ............ directed flow graph (§3.2)
05  lspar_leiden.py .......... L-Spar + Sinkhorn-Knopp sparsification (Alg. 2)
    dcsbm.py (run_dcsbm.sh) .. nested DC-SBM genres (Alg. 3)  <- chosen method
06  flow_prep.py ............. flow parquet -> int edge list
    gateway_score.py ......... gateway entropy scoring (Alg. 4)
    map_gateways.py .......... directional "entry doors" per genre (§5.3)
07  gateway_viz/ ............. interactive map (build_viz_data.py + index.html)

Chosen vs. explored

The report is explicit that several methods were tried and rejected. Those scripts are kept (they produced the comparisons in §4.1 and the negative experiment in §6), but the chosen path is: cosine-log similarity → L-Spar+SK → nested DC-SBM → entropy-based gateway scoring. Rejected/alternative scripts are flagged as [explored] in each folder's README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline — Gateway Artists Between Music Genres

The two-graph idea (so the phases make sense)

Phases

End-to-end flow

Chosen vs. explored

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
01_data_collection		01_data_collection
02_metadata_resolution		02_metadata_resolution
03_data_cleaning		03_data_cleaning
04_graph_construction		04_graph_construction
05_community_detection		05_community_detection
06_gateway_scoring		06_gateway_scoring
07_visualization		07_visualization
utils		utils
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Pipeline — Gateway Artists Between Music Genres

The two-graph idea (so the phases make sense)

Phases

End-to-end flow

Chosen vs. explored

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages