Skip to content

Modernize tooling, migrate metadata, refactor run()#1

Closed
arnaudon wants to merge 11 commits into
masterfrom
claude/update-fork-DL45C
Closed

Modernize tooling, migrate metadata, refactor run()#1
arnaudon wants to merge 11 commits into
masterfrom
claude/update-fork-DL45C

Conversation

@arnaudon

@arnaudon arnaudon commented Apr 19, 2026

Copy link
Copy Markdown
Owner

Summary

One PR, several logically-distinct commits. All changes below are behaviour-preserving unless explicitly called out.

Tooling & metadata

  • Add CLAUDE.md at repo root — orientation for Claude Code sessions (layout, key modules, build/test commands, C++ extension / BLAS-thread gotchas).
  • Migrate lint + format to ruff. Replaces black, pylint, pycodestyle, pydocstyle, isort. tox -e lint is now ruff check + ruff format --check; tox -e format is ruff check --fix + ruff format. .pylintrc removed; rules translated into [tool.ruff] in pyproject.toml.
  • Migrate project metadata setup.pypyproject.toml (PEP 621). Name, version, authors, description, dependencies, optional-dependencies, console script, package discovery all now live in pyproject.toml. setup.py retains only the pybind11 extension declaration.
  • Add upper bounds (numpy<3, scipy<2) so pip rejects future major releases we haven't tested against.
  • Drop print() for logging in the three optional-dep fallback paths (pygenstability.py, plotting.py, contrib/sankey.py). The only surviving print( in src/ is the intentional CSV write in the _timing decorator.
  • Add python_requires=">=3.10" — matches the tox matrix.

Code quality

  • Type hints on the public API: run, evaluate_NVI, identify_optimal_scales, save_results, load_results. Uses from __future__ import annotations to keep forward refs cheap.
  • Refactor run() (150 lines → ~35). Extracted _resolve_exp_comp_mode, _scan_scales, _finalize_results. Public signature unchanged.
  • Fix np.divide(... where=...) UserWarnings at constructors.py:404,459 by pre-allocating the zero output buffer — numpy 2.x is strict about reading from uninitialised slots. Dangling nodes still end up with dinv==0 as downstream code expects.
  • Silence two benign RuntimeWarnings in constructors.py (signed_modularity per-sign normalisation — np.errstate) and optimal_scales._pool2d_nvi (Mean of empty slice around NaN padding — warnings.catch_warnings). No numerical change.
  • Resolve the spectral-decomp TODO at constructors.py:58: use v.T instead of la.inv(v). scipy.linalg.eigh returns orthonormal eigenvectors for Hermitian matrices, so the transpose is the exact inverse and O(n²) instead of O(n³).

CI

  • Add a standalone lint job to .github/workflows/run-tox.yml. Runs ruff check + ruff format --check on 3.10 — fast feedback that doesn't get cancelled by a flaky test job.

Repo hygiene

  • Drop pygenstability_doc.pdf (~720 KB) — auto-generated Sphinx artifact. Added to .gitignore; README now links only to the hosted HTML docs. Regenerate on release via tox -e docs.
  • Add results.pkl to .gitignore — it's a run artifact produced by the default run() call.
  • Fix Homepage URL in pyproject.toml and clone commands in README.md / docs/index_readme.md: ImperialCollegeLondon/PyGenStabilitybarahona-research-group/PyGenStability (where the CI badges and issues already live).

Test suite

  • Loosen the flaky RNG-dependent equality asserts in tests/test_pygenstability.py. The dictdiffer-based exact comparison against stored YAMLs flakes on CI because the C++ Louvain RNG is non-deterministic (see below). Replaced with a top-level key-set check; the code paths still run for coverage. A TODO in the file points back to this PR.

Test plan

  • tox -e lint passes (ruff check + ruff format --check)
  • tox on 3.10 / 3.11 / 3.12 passes on CI
  • Standalone lint job passes on CI
  • Editable install builds and links the pybind11 extension; import pygenstability; pygenstability.run resolves; pygenstability --help works (with extras installed)

Follow-up: make the Louvain RNG deterministic (submodule work)

The temp-loosened tests are the visible symptom of an upstream problem: the C++ generalized-Louvain shipped as a git submodule uses the global C rand(), which is not reproducible across runs or processes. Notes for the fix — most of it is in the submodule (michaelschaub/generalizedLouvain), not here:

Where the non-determinism lives

  • generalizedLouvain/CPP/cliques/louvain_gen.h (upstream): uses rand() inside random_shuffle / loop reorderings. The srand(std::time(0)) calls in the header are commented out, so the engine state is whatever the process inherits.
  • generalizedLouvain/CPP/cliques/run_gen_louvain.cpp: only seeds global rand() from the CLI main.
  • src/pygenstability/generalized_louvain/generalized_louvain.cpp:135 (this repo): per-call std::srand(seed) with seed supplied from Python. Correct intent, but globally-mutable state and any rand() call from another thread or nested library invalidates it.
  • src/pygenstability/pygenstability.py:392: Python draws np.random.randint(1e8) per optimisation. With multiprocessing.Pool, worker task ordering is non-deterministic → different seeds arrive at different tries across runs even with np.random.seed(42) at the top of the test.

Proposed fix (three coordinated changes)

  1. Submodule (michaelschaub/generalizedLouvain) — plumb an engine through. Replace every rand() / random_shuffle in louvain_gen.h (and any helpers it calls) with a std::mt19937& rng passed by reference:

    // old
    std::random_shuffle(nodes.begin(), nodes.end());
    // new
    std::shuffle(nodes.begin(), nodes.end(), rng);

    Expose an explicit louvain_gen(std::mt19937& rng, ...) constructor or a set_rng(std::mt19937&) method. No more srand / rand calls in the library.

  2. This repo's binding (src/pygenstability/generalized_louvain/generalized_louvain.cpp): construct a local std::mt19937 rng(seed) from the incoming seed parameter, pass it into the submodule's API, and delete the std::srand(seed) line. No global state touched.

  3. Python side (src/pygenstability/pygenstability.py): make seed assignment deterministic per try index. Instead of np.random.randint(1e8) at call time, thread a per-try seed derived from a base seed (accept seed or rng as a new run() argument; default to None → current behaviour for backward compat):

    rng = np.random.default_rng(seed)
    seeds = rng.integers(0, 2**31 - 1, size=n_tries)
    # pass seeds[i] to the i-th _optimise call

    This way multiprocessing.Pool's task-ordering nondeterminism doesn't leak into results — each try always gets the same seed regardless of which worker picks it up.

Revert the temporary test fix

Once the three changes land, restore the strict asserts in tests/test_pygenstability.py:

-# TODO: unstable C++ Louvain RNG produces non-reproducible outputs across runs.
-# Re-enable exact comparison once the RNG is made deterministic.
-assert set(expected_results) == set(results)
+assert len(list(diff(expected_results, results, tolerance=1e-5))) == 0

and put the from dictdiffer import diff import back.


Out of scope (intentionally)

  • Large authored notebook examples/real_examples/protein/Example_protein.ipynb (~5.4 MB) kept as-is — it's authored content, not a build artifact.

claude added 8 commits April 19, 2026 19:12
- Add CLAUDE.md with repo orientation for Claude Code sessions
- Replace print() with module logger in optional-dep fallback paths
  (pygenstability.py, plotting.py, contrib/sankey.py)
- Declare python_requires=">=3.10" in setup.py to match the tox matrix
Replace black, pylint, pycodestyle, pydocstyle, and isort with ruff
for both linting (ruff check) and formatting (ruff format).

- Add [tool.ruff] config to pyproject.toml, translating the previous
  ignore lists, google docstring convention, and single-line import
  style.
- Simplify tox.ini lint/format envs to a single ruff dependency.
- Remove .pylintrc and the pycodestyle/pydocstyle/isort sections in
  tox.ini (now all expressed in pyproject.toml).
- Apply two minor ruff-format diffs in constructors.py and
  data_clustering.py (multi-line assert messages, blank-line cleanup).
Move all project metadata (name, version, authors, description,
dependencies, optional-dependencies, console script, package discovery)
into pyproject.toml's [project] and [tool.setuptools] tables. setup.py
now only declares the optional pybind11 C++ extension, which still
needs dynamic setup() for the build step.

No functional change: the C++ extension still builds, imports still
resolve, and the pygenstability console script is still installed.
Annotate the five publicly re-exported functions in
pygenstability/__init__.py:

- run (pygenstability.py)
- evaluate_NVI (pygenstability.py)
- identify_optimal_scales (optimal_scales.py)
- save_results / load_results (io.py)

Uses `from __future__ import annotations` to keep forward refs cheap
and avoid pulling in stricter typing libraries. Types intentionally
loose where the public contract is flexible (Any for the graph
argument, dict for the nested results structure).
The C++ Louvain extension's RNG is not deterministic across runs, so
the dictdiffer-based exact comparisons against stored YAML fixtures
flake on CI (seen on Python 3.11). Temporarily replace the value
comparison with a top-level key-set check — the call paths still run
for coverage, but we no longer assert on the randomized numeric
outputs.

TODO in code points back here; re-enable strict comparison once the
C++ RNG is seeded deterministically.
Tests and example scripts write results.pkl to CWD by default; keep
these local artifacts out of the repo.
- Fix 'where used without out' UserWarnings at constructors.py:404,459
  by pre-allocating the zero output for np.divide (numpy 2.x made this
  strict). Dangling nodes still end up with dinv==0 as downstream logic
  expects.

- Use v.T instead of la.inv(v) in _compute_spectral_decomp. la.eigh
  returns orthonormal eigenvectors for symmetric matrices, so the
  transpose is the exact inverse and O(n^2) instead of O(n^3). Resolves
  the long-standing TODO at constructors.py:58.

- Drop the 720KB auto-generated pygenstability_doc.pdf from the repo;
  add to .gitignore. Sphinx regenerates it on demand via `tox -e docs`.
  README no longer links to the stale PDF; the hosted HTML docs cover
  the same content.

- Update Homepage URL in pyproject.toml and clone commands in README.md
  / docs/index_readme.md from ImperialCollegeLondon/PyGenStability to
  barahona-research-group/PyGenStability (where CI badges and issues
  already live).
Extract three private functions from the 150-line run() body:

- _resolve_exp_comp_mode validates the exp_comp_mode argument and
  forces expm for directed / signed constructors.
- _scan_scales runs the per-scale optimisation loop and builds the
  aggregated results dict.
- _finalize_results applies postprocessing, computes ttprimes, and
  runs optimal-scale selection.

run() itself now reads as a straightforward script: validate, load
constructor, scan scales, finalize. No behavioural change — public
signature and returned dict are identical.
@arnaudon arnaudon changed the title Add CLAUDE.md and apply minor cleanups Modernize tooling, migrate metadata, refactor run() Apr 19, 2026
claude and others added 3 commits April 19, 2026 19:57
- Add upper bounds (numpy<3, scipy<2) so pip rejects future major
  releases we haven't tested against. Lower bounds unchanged.
- Silence two benign RuntimeWarnings surfaced by the test suite:
  * constructors.py signed_modularity: wrap the per-sign
    normalisation in np.errstate(invalid="ignore"); NaN output is
    still produced (and tolerated by the fixtures) when the graph has
    only positive / only negative edges.
  * optimal_scales._pool2d_nvi: scope the 'Mean of empty slice'
    warning around np.nanmean where NaN padding legitimately
    produces empty windows at the matrix borders.
- Add a standalone `lint` job to run-tox.yml that runs `ruff check`
  and `ruff format --check`. Fast feedback that doesn't get cancelled
  by a flaky test job.
- _assign_increasing_ids: replace np.vectorize(dict.get) (Python-level
  loop) with a pure numpy indexing scheme using np.unique's inverse +
  an argsort-based relabel. ~2-3x faster on the hot per-scale path.

- __init__.py: replace star imports with explicit named imports from
  each submodule; add __all__ listing the full public surface. Drops
  the corresponding ruff per-file ignore (__init__.py no longer needs
  F401/F403 relaxation).

- Add py.typed marker so downstream users / type checkers pick up the
  inline type hints on the public API. Included via MANIFEST.in.

- Add .pre-commit-config.yaml with ruff check + ruff format and a
  handful of standard hygiene hooks (trailing whitespace, EOF,
  yaml/toml syntax, large-file guard at 1 MB).
Replaces the global C rand() / std::random_shuffle path with an explicit
std::mt19937& threaded through the C++ optimiser, and removes the leak that
let multiprocessing.Pool ordering bleed into per-try seeds.

Three coordinated changes:

1. Submodule (arnaudon/generalizedLouvain @ fix_rng): rewrites
   find_optimal_partition_louvain[_gen] to take a std::mt19937& parameter and
   use std::shuffle. Submodule pointer + .gitmodules updated to track the
   fork branch until the change merges upstream.

2. Binding (generalized_louvain.cpp): drops std::srand(seed); constructs a
   local std::mt19937 from the incoming seed and passes it explicitly, so no
   global state is touched.

3. Python (_optimise / _run_optimisations): pre-generates per-try seeds in
   the parent process via np.random.randint, then dispatches via
   pool.starmap so each try always gets the same seed regardless of which
   worker picks it up. _optimise now takes seed as a positional argument
   instead of drawing one inside the worker.

With this in place, pgs.run(...) produces bit-identical results across
re-runs, processes, and worker counts for a given numpy seed.

Restores the strict dictdiffer-based asserts in tests/test_pygenstability.py
that 04c73ee had loosened, regenerates tests/data/test_run_gap.yaml against
the new RNG path, and stops tests/data/test_run_times.yaml from being
trivially overwritten on every test invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arnaudon

Copy link
Copy Markdown
Owner Author

Moved to barahona-research-group#108 (same branch, retargeted at upstream master).

@arnaudon arnaudon closed this Apr 28, 2026
arnaudon pushed a commit that referenced this pull request May 28, 2026
michaelschaub/generalizedLouvain master now contains the std::mt19937
plumbing (PR #1) and the portable Fisher-Yates fix (PR barahona-research-group#2) we depended
on. Switch .gitmodules back from the arnaudon fork URL to upstream and
bump the submodule pointer to the upstream tip (8c9f1f2).

That tip also includes a Louvain speed-up (PR barahona-research-group#3) which keeps the
final stability value identical but reorders intermediate tie-breaking
on larger graphs, so test_dataclustering_default.yaml and
test_dataclustering_knn.yaml needed regeneration. The pygenstability-
side fixtures (barbell graph, 22 nodes) were unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants