linearized_directed: sparse path for alpha<1 (issue #69) by arnaudon · Pull Request #110 · barahona-research-group/PyGenStability

arnaudon · 2026-04-29T18:31:23Z

Summary

Refactors constructor_linearized_directed so the alpha < 1 path no longer materialises an N×N teleportation matrix. Decomposes M(alpha) = alpha * D^-1 A + u * 1^T and stores only the sparse part alpha * D^-1 A - I in partial_quality_matrix; the rank-1 teleportation correction is folded into the null model as an extra pair (-t * Pi * u, 1). The stationary distribution pi is computed via a matrix-free scipy.sparse.linalg.LinearOperator + ARPACK, never allocating a dense matrix.
Memory drops from O(N^2) to O(nnz(A) + N). The original 180k-node OOM in Memory issue for large directed graphs #69 (~241 GiB) becomes a few hundred MB.
Adds a docstring warning to constructor_directed (matrix-exponential variant remains unsupported for large graphs — exp(t * (M - I)) is dense even with the sparse + low-rank decomposition).
Targets plotting-cleanup rather than master since it stacks on that branch.

Changes

src/pygenstability/constructors.py: refactored constructor_linearized_directed.prepare() and _get_data(); warning admonition added to constructor_directed.
tests/test_constructors.py: 6 new test cases (parametrised over alpha):
- test_linearized_directed_alpha_lt_one_matches_dense — sparse path equals dense reference at machine eps for alpha ∈ {0.5, 0.8, 0.99}.
- test_linearized_directed_alpha_lt_one_dangling — handles dangling sinks (out-degree 0) for alpha ∈ {0.5, 0.85}.
- test_linearized_directed_alpha_lt_one_sparse_memory — 10k-node graph stays sparse with nnz <= nnz(A) + N.
tests/data/test_constructor_linearized_directed{,_gap}.yaml: regenerated to reflect new sparse quality + 4-vector null_model shape.

Notes

C++ Louvain backend already supports arbitrary even number of null-model vector pairs (signed_modularity uses 2 pairs); no C++ changes needed.
The matrix-free LinearOperator uses v0 = linspace(1, 2)/sum(...) to avoid ARPACK error -9 ("starting vector is zero") on doubly-stochastic walks where the uniform vector is exactly the stationary distribution.
All 26 existing tests still pass; mypy and ruff clean.

Test plan

pytest tests/test_constructors.py — 10 passed (4 existing + 6 new)
pytest tests/ — 26 passed
mypy src/pygenstability/constructors.py — no issues
ruff check and ruff format --check — clean
Manual smoke on 50k-node random graph: prepare in 0.24s, peak RSS 212 MB
CI runs

Closes part of #69 (linearized_directed only; the matrix-exponential directed constructor remains O(N²) by design, now documented as such).

🤖 Generated with Claude Code

- Add CLAUDE.md with repo orientation for Claude Code sessions - Replace print() with module logger in optional-dep fallback paths (pygenstability.py, plotting.py, contrib/sankey.py) - Declare python_requires=">=3.10" in setup.py to match the tox matrix

Replace black, pylint, pycodestyle, pydocstyle, and isort with ruff for both linting (ruff check) and formatting (ruff format). - Add [tool.ruff] config to pyproject.toml, translating the previous ignore lists, google docstring convention, and single-line import style. - Simplify tox.ini lint/format envs to a single ruff dependency. - Remove .pylintrc and the pycodestyle/pydocstyle/isort sections in tox.ini (now all expressed in pyproject.toml). - Apply two minor ruff-format diffs in constructors.py and data_clustering.py (multi-line assert messages, blank-line cleanup).

Move all project metadata (name, version, authors, description, dependencies, optional-dependencies, console script, package discovery) into pyproject.toml's [project] and [tool.setuptools] tables. setup.py now only declares the optional pybind11 C++ extension, which still needs dynamic setup() for the build step. No functional change: the C++ extension still builds, imports still resolve, and the pygenstability console script is still installed.

Annotate the five publicly re-exported functions in pygenstability/__init__.py: - run (pygenstability.py) - evaluate_NVI (pygenstability.py) - identify_optimal_scales (optimal_scales.py) - save_results / load_results (io.py) Uses `from __future__ import annotations` to keep forward refs cheap and avoid pulling in stricter typing libraries. Types intentionally loose where the public contract is flexible (Any for the graph argument, dict for the nested results structure).

The C++ Louvain extension's RNG is not deterministic across runs, so the dictdiffer-based exact comparisons against stored YAML fixtures flake on CI (seen on Python 3.11). Temporarily replace the value comparison with a top-level key-set check — the call paths still run for coverage, but we no longer assert on the randomized numeric outputs. TODO in code points back here; re-enable strict comparison once the C++ RNG is seeded deterministically.

Tests and example scripts write results.pkl to CWD by default; keep these local artifacts out of the repo.

- Fix 'where used without out' UserWarnings at constructors.py:404,459 by pre-allocating the zero output for np.divide (numpy 2.x made this strict). Dangling nodes still end up with dinv==0 as downstream logic expects. - Use v.T instead of la.inv(v) in _compute_spectral_decomp. la.eigh returns orthonormal eigenvectors for symmetric matrices, so the transpose is the exact inverse and O(n^2) instead of O(n^3). Resolves the long-standing TODO at constructors.py:58. - Drop the 720KB auto-generated pygenstability_doc.pdf from the repo; add to .gitignore. Sphinx regenerates it on demand via `tox -e docs`. README no longer links to the stale PDF; the hosted HTML docs cover the same content. - Update Homepage URL in pyproject.toml and clone commands in README.md / docs/index_readme.md from ImperialCollegeLondon/PyGenStability to barahona-research-group/PyGenStability (where CI badges and issues already live).

Extract three private functions from the 150-line run() body: - _resolve_exp_comp_mode validates the exp_comp_mode argument and forces expm for directed / signed constructors. - _scan_scales runs the per-scale optimisation loop and builds the aggregated results dict. - _finalize_results applies postprocessing, computes ttprimes, and runs optimal-scale selection. run() itself now reads as a straightforward script: validate, load constructor, scan scales, finalize. No behavioural change — public signature and returned dict are identical.

- Add upper bounds (numpy<3, scipy<2) so pip rejects future major releases we haven't tested against. Lower bounds unchanged. - Silence two benign RuntimeWarnings surfaced by the test suite: * constructors.py signed_modularity: wrap the per-sign normalisation in np.errstate(invalid="ignore"); NaN output is still produced (and tolerated by the fixtures) when the graph has only positive / only negative edges. * optimal_scales._pool2d_nvi: scope the 'Mean of empty slice' warning around np.nanmean where NaN padding legitimately produces empty windows at the matrix borders. - Add a standalone `lint` job to run-tox.yml that runs `ruff check` and `ruff format --check`. Fast feedback that doesn't get cancelled by a flaky test job.

- _assign_increasing_ids: replace np.vectorize(dict.get) (Python-level loop) with a pure numpy indexing scheme using np.unique's inverse + an argsort-based relabel. ~2-3x faster on the hot per-scale path. - __init__.py: replace star imports with explicit named imports from each submodule; add __all__ listing the full public surface. Drops the corresponding ruff per-file ignore (__init__.py no longer needs F401/F403 relaxation). - Add py.typed marker so downstream users / type checkers pick up the inline type hints on the public API. Included via MANIFEST.in. - Add .pre-commit-config.yaml with ruff check + ruff format and a handful of standard hygiene hooks (trailing whitespace, EOF, yaml/toml syntax, large-file guard at 1 MB).

Replaces the global C rand() / std::random_shuffle path with an explicit std::mt19937& threaded through the C++ optimiser, and removes the leak that let multiprocessing.Pool ordering bleed into per-try seeds. Three coordinated changes: 1. Submodule (arnaudon/generalizedLouvain @ fix_rng): rewrites find_optimal_partition_louvain[_gen] to take a std::mt19937& parameter and use std::shuffle. Submodule pointer + .gitmodules updated to track the fork branch until the change merges upstream. 2. Binding (generalized_louvain.cpp): drops std::srand(seed); constructs a local std::mt19937 from the incoming seed and passes it explicitly, so no global state is touched. 3. Python (_optimise / _run_optimisations): pre-generates per-try seeds in the parent process via np.random.randint, then dispatches via pool.starmap so each try always gets the same seed regardless of which worker picks it up. _optimise now takes seed as a positional argument instead of drawing one inside the worker. With this in place, pgs.run(...) produces bit-identical results across re-runs, processes, and worker counts for a given numpy seed. Restores the strict dictdiffer-based asserts in tests/test_pygenstability.py that 04c73ee had loosened, regenerates tests/data/test_run_gap.yaml against the new RNG path, and stops tests/data/test_run_times.yaml from being trivially overwritten on every test invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Submodule update (arnaudon/generalizedLouvain @ fix_rng → 59bd6f0) replaces std::shuffle with a hand-rolled Fisher-Yates that uses mt19937 output via rng() % i. std::shuffle's internal uniform_int_distribution is implementation-defined, so libstdc++ (Linux CI) and libc++ (macOS dev) produced different orderings from the same seed — strict fixtures regenerated locally diverged from CI output. Modulo-biased Fisher-Yates is bit-identical across compilers, which is what fixed test fixtures need. test_run_gap.yaml and test_run_times.yaml regenerated; the other three yaml fixtures already matched the new shuffle output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- conftest.py: force Agg matplotlib backend so plt.show() is a no-op - contrib/sankey.py: pass auto_open=live to plotly.offline.plot, matching the existing pattern in plot_scan_plotly. Without this, the Sankey HTML always opened in the browser regardless of the live flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on test Tests audit follow-up. - tests/test_dataclustering.py: re-enable the three strict dictdiffer-based YAML asserts that were commented out with the note "this is unstable due to not consistent rng in C++". With the deterministic Louvain RNG in place, they now pass. Generates the test_dataclustering_default.yaml and test_dataclustering_knn.yaml fixtures. - tests/test_pygenstability.py: add test_run_is_deterministic, an explicit regression test that asserts same numpy seed yields identical community_id and stability across repeated pgs.run calls. This covers the determinism contract directly rather than relying on the YAML-fixture comparison to imply it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michaelschaub/generalizedLouvain master now contains the std::mt19937 plumbing (PR #1) and the portable Fisher-Yates fix (PR #2) we depended on. Switch .gitmodules back from the arnaudon fork URL to upstream and bump the submodule pointer to the upstream tip (8c9f1f2). That tip also includes a Louvain speed-up (PR #3) which keeps the final stability value identical but reorders intermediate tie-breaking on larger graphs, so test_dataclustering_default.yaml and test_dataclustering_knn.yaml needed regeneration. The pygenstability- side fixtures (barbell graph, 22 nodes) were unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e paths Bug fixes (real footguns): - plot_communities: drop the matplotlib.use('Agg') / restore dance. It mutated process-global state, was not thread-safe, and surprised callers using a different backend. The loop already calls plt.figure() + plt.close() per scale, so the swap was redundant. - plot_communities_matrix: add the missing plt.close() per scale. Previously every scale leaked its figure, accumulating memory and (with an interactive backend) leaving stale windows. - plot_clustered_adjacency: cast the indexed adjacency to float before assigning np.nan to the zero entries. Integer adjacencies (which is what np.matrix(int) and many sparse-to-dense conversions produce) raised ValueError on the assignment. - plot_robust_partitions: flip the default show=True to show=False to match library convention — callers shouldn't have plt.show() invoked on their behalf unless they ask for it. The only call sites in this repo (tests, examples) pass show explicitly, so no behavioural break. Cleanups: - Replace os.path.isdir / os.mkdir with Path.mkdir(parents=True, exist_ok=True) — race-free and one line. - f-string the title concatenations. - Drop now-unused `os` and bare `matplotlib` imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3.10 reaches end-of-life October 2026, so cycle it out and pick up 3.13 instead. Updates the tox envlist, gh-actions mapping, GitHub workflow matrix, ruff target-version, requires-python, and the CLAUDE.md note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Annotates public APIs and most internal helpers in constructors, plotting, data_clustering, app, optimal_scales, contrib/sankey, and pygenstability; uses Any pragmatically where numpy/scipy/matplotlib stubs are too narrow. Adds a [tool.mypy] config, a typecheck tox env, and a typecheck CI job.

Decompose M(alpha) = alpha * D^-1 A + u * 1^T into its sparse part and a rank-1 teleportation correction. Keep only alpha * D^-1 A - I in partial_quality_matrix and fold the rank-1 piece (scale * Pi * u) o 1 into the null model as an extra pair. Compute pi via a matrix-free LinearOperator + ARPACK so neither the prepare nor the get_data path allocates an N x N matrix. Memory drops from O(N^2) to O(nnz(A) + N), making the constructor usable on graphs with hundreds of thousands of nodes (the 180k-node OOM in issue #69). Also documents constructor_directed (matrix-exponential variant) as unsupported for large graphs, since exp(t * (M - I)) is dense even when M is sparse + low-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude and others added 20 commits April 19, 2026 19:12

Ignore results.pkl run artifact

b0f7ac2

Tests and example scripts write results.pkl to CWD by default; keep these local artifacts out of the repo.

Fix E501 line-length on _optimise signature

72abccb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced Apr 29, 2026

Memory issue for large directed graphs #69

Open

Leiden vs Louvain disagree on multi-pair null model normalisation #111

Open

juni-schindler self-requested a review May 26, 2026 08:05

arnaudon force-pushed the plotting-cleanup branch from 22ed0a2 to fc903a6 Compare May 29, 2026 08:29

This was referenced Jun 4, 2026

Fix Leiden multi-pair null-model normalisation to match Louvain (#111) #115

Open

Fix Leiden null-model handling for asymmetric/signed and multi-pair nulls #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linearized_directed: sparse path for alpha<1 (issue #69)#110

linearized_directed: sparse path for alpha<1 (issue #69)#110
arnaudon wants to merge 20 commits into
plotting-cleanupfrom
sparse-linearized-directed

arnaudon commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arnaudon commented Apr 29, 2026

Summary

Changes

Notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants