Skip to content

nelliDev/music-genre-gateways

Repository files navigation

Pipeline — Gateway Artists Between Music Genres

Organized, phase-by-phase copy of every script used to produce the results in the final report (latex/final/final.tex). The folders follow the report's narrative order, so a reader who has only seen the report can open the matching folder and find the scripts that produced each section.

This is a reference / documentation layout. The scripts are copies of the originals in code/ and code2/; data files, parquets and graphmls are not copied (they are large and live next to the originals). Paths inside the scripts still point at the original working directories.

The two-graph idea (so the phases make sense)

The whole project is built on two graphs over the same set of artists (and, secondarily, albums):

  • Similarity graph (undirected): edge weight = how much two artists share an audience. Used to approximate genres via community detection.
  • Flow graph (directed): edge A → B = a user listened to B right after A. Used to measure listeners crossing between those genres — the gateway signal.

Phases

Folder Report section What happens here
01_data_collection/ §2.1 Download the MLHD+ dataset and inventory users.
02_metadata_resolution/ §2.2 Resolve/canonicalize MBIDs to names via a local MusicBrainz mirror.
03_data_cleaning/ §2.1 Filter bots/low-density users, sample 10% → 52,800 users.
04_graph_construction/ §3 Build the similarity graph (cosine-log) and the directed flow graph.
05_community_detection/ §4, §6 Sparsify (L-Spar+SK), fit the nested DC-SBM, compare alternatives, name genres.
06_gateway_scoring/ §5 Score gateway artists on the flow graph via source-genre entropy.
07_visualization/ §6.4 Interactive Cytoscape.js gateway map (DC-SBM ℓ1 artist genres).
utils/ Standalone inspection/debug helpers used during development.

Each folder has its own README.md describing every script in it and pointing at the relevant report section, algorithm, table or figure.

End-to-end flow

01  downloader.py ........... fetch MLHD+ tarballs
02  build_lookup_parquets.py  MBID -> canonical name lookups (local MB mirror)
03  filter_users.py + sample_filtered_users.py  -> 52,800 users
04  gpu_build_graph_coslog.py  similarity graph   (Alg. 1)
    flow-graph.py ............ directed flow graph (§3.2)
05  lspar_leiden.py .......... L-Spar + Sinkhorn-Knopp sparsification (Alg. 2)
    dcsbm.py (run_dcsbm.sh) .. nested DC-SBM genres (Alg. 3)  <- chosen method
06  flow_prep.py ............. flow parquet -> int edge list
    gateway_score.py ......... gateway entropy scoring (Alg. 4)
    map_gateways.py .......... directional "entry doors" per genre (§5.3)
07  gateway_viz/ ............. interactive map (build_viz_data.py + index.html)

Chosen vs. explored

The report is explicit that several methods were tried and rejected. Those scripts are kept (they produced the comparisons in §4.1 and the negative experiment in §6), but the chosen path is: cosine-log similarity → L-Spar+SK → nested DC-SBM → entropy-based gateway scoring. Rejected/alternative scripts are flagged as [explored] in each folder's README.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors