Organized, phase-by-phase copy of every script used to produce the results in the
final report (latex/final/final.tex). The folders follow the report's narrative
order, so a reader who has only seen the report can open the matching folder and
find the scripts that produced each section.
This is a reference / documentation layout. The scripts are copies of the originals in
code/andcode2/; data files, parquets and graphmls are not copied (they are large and live next to the originals). Paths inside the scripts still point at the original working directories.
The whole project is built on two graphs over the same set of artists (and, secondarily, albums):
- Similarity graph (undirected): edge weight = how much two artists share an audience. Used to approximate genres via community detection.
- Flow graph (directed): edge
A → B= a user listened toBright afterA. Used to measure listeners crossing between those genres — the gateway signal.
| Folder | Report section | What happens here |
|---|---|---|
01_data_collection/ |
§2.1 | Download the MLHD+ dataset and inventory users. |
02_metadata_resolution/ |
§2.2 | Resolve/canonicalize MBIDs to names via a local MusicBrainz mirror. |
03_data_cleaning/ |
§2.1 | Filter bots/low-density users, sample 10% → 52,800 users. |
04_graph_construction/ |
§3 | Build the similarity graph (cosine-log) and the directed flow graph. |
05_community_detection/ |
§4, §6 | Sparsify (L-Spar+SK), fit the nested DC-SBM, compare alternatives, name genres. |
06_gateway_scoring/ |
§5 | Score gateway artists on the flow graph via source-genre entropy. |
07_visualization/ |
§6.4 | Interactive Cytoscape.js gateway map (DC-SBM ℓ1 artist genres). |
utils/ |
— | Standalone inspection/debug helpers used during development. |
Each folder has its own README.md describing every script in it and pointing at
the relevant report section, algorithm, table or figure.
01 downloader.py ........... fetch MLHD+ tarballs
02 build_lookup_parquets.py MBID -> canonical name lookups (local MB mirror)
03 filter_users.py + sample_filtered_users.py -> 52,800 users
04 gpu_build_graph_coslog.py similarity graph (Alg. 1)
flow-graph.py ............ directed flow graph (§3.2)
05 lspar_leiden.py .......... L-Spar + Sinkhorn-Knopp sparsification (Alg. 2)
dcsbm.py (run_dcsbm.sh) .. nested DC-SBM genres (Alg. 3) <- chosen method
06 flow_prep.py ............. flow parquet -> int edge list
gateway_score.py ......... gateway entropy scoring (Alg. 4)
map_gateways.py .......... directional "entry doors" per genre (§5.3)
07 gateway_viz/ ............. interactive map (build_viz_data.py + index.html)
The report is explicit that several methods were tried and rejected. Those scripts
are kept (they produced the comparisons in §4.1 and the negative experiment in §6),
but the chosen path is: cosine-log similarity → L-Spar+SK → nested DC-SBM →
entropy-based gateway scoring. Rejected/alternative scripts are flagged as
[explored] in each folder's README.