Modernize tooling, migrate metadata, refactor run()#1
Closed
arnaudon wants to merge 11 commits into
Closed
Conversation
- Add CLAUDE.md with repo orientation for Claude Code sessions - Replace print() with module logger in optional-dep fallback paths (pygenstability.py, plotting.py, contrib/sankey.py) - Declare python_requires=">=3.10" in setup.py to match the tox matrix
Replace black, pylint, pycodestyle, pydocstyle, and isort with ruff for both linting (ruff check) and formatting (ruff format). - Add [tool.ruff] config to pyproject.toml, translating the previous ignore lists, google docstring convention, and single-line import style. - Simplify tox.ini lint/format envs to a single ruff dependency. - Remove .pylintrc and the pycodestyle/pydocstyle/isort sections in tox.ini (now all expressed in pyproject.toml). - Apply two minor ruff-format diffs in constructors.py and data_clustering.py (multi-line assert messages, blank-line cleanup).
Move all project metadata (name, version, authors, description, dependencies, optional-dependencies, console script, package discovery) into pyproject.toml's [project] and [tool.setuptools] tables. setup.py now only declares the optional pybind11 C++ extension, which still needs dynamic setup() for the build step. No functional change: the C++ extension still builds, imports still resolve, and the pygenstability console script is still installed.
Annotate the five publicly re-exported functions in pygenstability/__init__.py: - run (pygenstability.py) - evaluate_NVI (pygenstability.py) - identify_optimal_scales (optimal_scales.py) - save_results / load_results (io.py) Uses `from __future__ import annotations` to keep forward refs cheap and avoid pulling in stricter typing libraries. Types intentionally loose where the public contract is flexible (Any for the graph argument, dict for the nested results structure).
The C++ Louvain extension's RNG is not deterministic across runs, so the dictdiffer-based exact comparisons against stored YAML fixtures flake on CI (seen on Python 3.11). Temporarily replace the value comparison with a top-level key-set check — the call paths still run for coverage, but we no longer assert on the randomized numeric outputs. TODO in code points back here; re-enable strict comparison once the C++ RNG is seeded deterministically.
Tests and example scripts write results.pkl to CWD by default; keep these local artifacts out of the repo.
- Fix 'where used without out' UserWarnings at constructors.py:404,459 by pre-allocating the zero output for np.divide (numpy 2.x made this strict). Dangling nodes still end up with dinv==0 as downstream logic expects. - Use v.T instead of la.inv(v) in _compute_spectral_decomp. la.eigh returns orthonormal eigenvectors for symmetric matrices, so the transpose is the exact inverse and O(n^2) instead of O(n^3). Resolves the long-standing TODO at constructors.py:58. - Drop the 720KB auto-generated pygenstability_doc.pdf from the repo; add to .gitignore. Sphinx regenerates it on demand via `tox -e docs`. README no longer links to the stale PDF; the hosted HTML docs cover the same content. - Update Homepage URL in pyproject.toml and clone commands in README.md / docs/index_readme.md from ImperialCollegeLondon/PyGenStability to barahona-research-group/PyGenStability (where CI badges and issues already live).
Extract three private functions from the 150-line run() body: - _resolve_exp_comp_mode validates the exp_comp_mode argument and forces expm for directed / signed constructors. - _scan_scales runs the per-scale optimisation loop and builds the aggregated results dict. - _finalize_results applies postprocessing, computes ttprimes, and runs optimal-scale selection. run() itself now reads as a straightforward script: validate, load constructor, scan scales, finalize. No behavioural change — public signature and returned dict are identical.
- Add upper bounds (numpy<3, scipy<2) so pip rejects future major
releases we haven't tested against. Lower bounds unchanged.
- Silence two benign RuntimeWarnings surfaced by the test suite:
* constructors.py signed_modularity: wrap the per-sign
normalisation in np.errstate(invalid="ignore"); NaN output is
still produced (and tolerated by the fixtures) when the graph has
only positive / only negative edges.
* optimal_scales._pool2d_nvi: scope the 'Mean of empty slice'
warning around np.nanmean where NaN padding legitimately
produces empty windows at the matrix borders.
- Add a standalone `lint` job to run-tox.yml that runs `ruff check`
and `ruff format --check`. Fast feedback that doesn't get cancelled
by a flaky test job.
- _assign_increasing_ids: replace np.vectorize(dict.get) (Python-level loop) with a pure numpy indexing scheme using np.unique's inverse + an argsort-based relabel. ~2-3x faster on the hot per-scale path. - __init__.py: replace star imports with explicit named imports from each submodule; add __all__ listing the full public surface. Drops the corresponding ruff per-file ignore (__init__.py no longer needs F401/F403 relaxation). - Add py.typed marker so downstream users / type checkers pick up the inline type hints on the public API. Included via MANIFEST.in. - Add .pre-commit-config.yaml with ruff check + ruff format and a handful of standard hygiene hooks (trailing whitespace, EOF, yaml/toml syntax, large-file guard at 1 MB).
Replaces the global C rand() / std::random_shuffle path with an explicit std::mt19937& threaded through the C++ optimiser, and removes the leak that let multiprocessing.Pool ordering bleed into per-try seeds. Three coordinated changes: 1. Submodule (arnaudon/generalizedLouvain @ fix_rng): rewrites find_optimal_partition_louvain[_gen] to take a std::mt19937& parameter and use std::shuffle. Submodule pointer + .gitmodules updated to track the fork branch until the change merges upstream. 2. Binding (generalized_louvain.cpp): drops std::srand(seed); constructs a local std::mt19937 from the incoming seed and passes it explicitly, so no global state is touched. 3. Python (_optimise / _run_optimisations): pre-generates per-try seeds in the parent process via np.random.randint, then dispatches via pool.starmap so each try always gets the same seed regardless of which worker picks it up. _optimise now takes seed as a positional argument instead of drawing one inside the worker. With this in place, pgs.run(...) produces bit-identical results across re-runs, processes, and worker counts for a given numpy seed. Restores the strict dictdiffer-based asserts in tests/test_pygenstability.py that 04c73ee had loosened, regenerates tests/data/test_run_gap.yaml against the new RNG path, and stops tests/data/test_run_times.yaml from being trivially overwritten on every test invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Owner
Author
|
Moved to barahona-research-group#108 (same branch, retargeted at upstream master). |
arnaudon
pushed a commit
that referenced
this pull request
May 28, 2026
michaelschaub/generalizedLouvain master now contains the std::mt19937 plumbing (PR #1) and the portable Fisher-Yates fix (PR barahona-research-group#2) we depended on. Switch .gitmodules back from the arnaudon fork URL to upstream and bump the submodule pointer to the upstream tip (8c9f1f2). That tip also includes a Louvain speed-up (PR barahona-research-group#3) which keeps the final stability value identical but reorders intermediate tie-breaking on larger graphs, so test_dataclustering_default.yaml and test_dataclustering_knn.yaml needed regeneration. The pygenstability- side fixtures (barbell graph, 22 nodes) were unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One PR, several logically-distinct commits. All changes below are behaviour-preserving unless explicitly called out.
Tooling & metadata
CLAUDE.mdat repo root — orientation for Claude Code sessions (layout, key modules, build/test commands, C++ extension / BLAS-thread gotchas).ruff. Replacesblack,pylint,pycodestyle,pydocstyle,isort.tox -e lintis nowruff check+ruff format --check;tox -e formatisruff check --fix+ruff format..pylintrcremoved; rules translated into[tool.ruff]inpyproject.toml.setup.py→pyproject.toml(PEP 621). Name, version, authors, description, dependencies, optional-dependencies, console script, package discovery all now live inpyproject.toml.setup.pyretains only the pybind11 extension declaration.numpy<3,scipy<2) so pip rejects future major releases we haven't tested against.print()forloggingin the three optional-dep fallback paths (pygenstability.py,plotting.py,contrib/sankey.py). The only survivingprint(insrc/is the intentional CSV write in the_timingdecorator.python_requires=">=3.10"— matches the tox matrix.Code quality
run,evaluate_NVI,identify_optimal_scales,save_results,load_results. Usesfrom __future__ import annotationsto keep forward refs cheap.run()(150 lines → ~35). Extracted_resolve_exp_comp_mode,_scan_scales,_finalize_results. Public signature unchanged.np.divide(... where=...)UserWarnings atconstructors.py:404,459by pre-allocating the zero output buffer — numpy 2.x is strict about reading from uninitialised slots. Dangling nodes still end up withdinv==0as downstream code expects.constructors.py(signed_modularityper-sign normalisation —np.errstate) andoptimal_scales._pool2d_nvi(Mean of empty slicearound NaN padding —warnings.catch_warnings). No numerical change.constructors.py:58: usev.Tinstead ofla.inv(v).scipy.linalg.eighreturns orthonormal eigenvectors for Hermitian matrices, so the transpose is the exact inverse and O(n²) instead of O(n³).CI
lintjob to.github/workflows/run-tox.yml. Runsruff check+ruff format --checkon 3.10 — fast feedback that doesn't get cancelled by a flaky test job.Repo hygiene
pygenstability_doc.pdf(~720 KB) — auto-generated Sphinx artifact. Added to.gitignore; README now links only to the hosted HTML docs. Regenerate on release viatox -e docs.results.pklto.gitignore— it's a run artifact produced by the defaultrun()call.pyproject.tomland clone commands inREADME.md/docs/index_readme.md:ImperialCollegeLondon/PyGenStability→barahona-research-group/PyGenStability(where the CI badges and issues already live).Test suite
tests/test_pygenstability.py. Thedictdiffer-based exact comparison against stored YAMLs flakes on CI because the C++ Louvain RNG is non-deterministic (see below). Replaced with a top-level key-set check; the code paths still run for coverage. ATODOin the file points back to this PR.Test plan
tox -e lintpasses (ruff check+ruff format --check)toxon 3.10 / 3.11 / 3.12 passes on CIimport pygenstability; pygenstability.runresolves;pygenstability --helpworks (with extras installed)Follow-up: make the Louvain RNG deterministic (submodule work)
The temp-loosened tests are the visible symptom of an upstream problem: the C++ generalized-Louvain shipped as a git submodule uses the global C
rand(), which is not reproducible across runs or processes. Notes for the fix — most of it is in the submodule (michaelschaub/generalizedLouvain), not here:Where the non-determinism lives
generalizedLouvain/CPP/cliques/louvain_gen.h(upstream): usesrand()insiderandom_shuffle/ loop reorderings. Thesrand(std::time(0))calls in the header are commented out, so the engine state is whatever the process inherits.generalizedLouvain/CPP/cliques/run_gen_louvain.cpp: only seeds globalrand()from the CLI main.src/pygenstability/generalized_louvain/generalized_louvain.cpp:135(this repo): per-callstd::srand(seed)withseedsupplied from Python. Correct intent, but globally-mutable state and anyrand()call from another thread or nested library invalidates it.src/pygenstability/pygenstability.py:392: Python drawsnp.random.randint(1e8)per optimisation. Withmultiprocessing.Pool, worker task ordering is non-deterministic → different seeds arrive at different tries across runs even withnp.random.seed(42)at the top of the test.Proposed fix (three coordinated changes)
Submodule (
michaelschaub/generalizedLouvain) — plumb an engine through. Replace everyrand()/random_shuffleinlouvain_gen.h(and any helpers it calls) with astd::mt19937& rngpassed by reference:Expose an
explicit louvain_gen(std::mt19937& rng, ...)constructor or aset_rng(std::mt19937&)method. No moresrand/randcalls in the library.This repo's binding (
src/pygenstability/generalized_louvain/generalized_louvain.cpp): construct a localstd::mt19937 rng(seed)from the incomingseedparameter, pass it into the submodule's API, and delete thestd::srand(seed)line. No global state touched.Python side (
src/pygenstability/pygenstability.py): make seed assignment deterministic per try index. Instead ofnp.random.randint(1e8)at call time, thread a per-try seed derived from a base seed (acceptseedorrngas a newrun()argument; default toNone→ current behaviour for backward compat):This way
multiprocessing.Pool's task-ordering nondeterminism doesn't leak into results — each try always gets the same seed regardless of which worker picks it up.Revert the temporary test fix
Once the three changes land, restore the strict asserts in
tests/test_pygenstability.py:and put the
from dictdiffer import diffimport back.Out of scope (intentionally)
examples/real_examples/protein/Example_protein.ipynb(~5.4 MB) kept as-is — it's authored content, not a build artifact.