Skip to content

GSoC-2026: GWPCA Prototyped #125

Open
FirePheonix wants to merge 11 commits into
pysal:mainfrom
FirePheonix:string-type-array-to-string
Open

GSoC-2026: GWPCA Prototyped #125
FirePheonix wants to merge 11 commits into
pysal:mainfrom
FirePheonix:string-type-array-to-string

Conversation

@FirePheonix

@FirePheonix FirePheonix commented Jun 6, 2026

Copy link
Copy Markdown
Contributor
  • the following Pull Request is GWPCA's proposed core prototype.
  • as suggested by @martinfleis , I'm putting a working version of gwpca with a sampled working notebook, tested on my local machine.
image
  • this PR completes the rough proposed work (i've completed till week 3 in the proposal), timeline wise (can adjust timeline any time)we're on week 2, but i believe it will be GREAT if we can have improvements discussions and documentation discussions over the code changes, gradually over the next 10 whole days (so we can have suggestion -> i improve -> suggestion -> improve)
  • Link to project details: https://summerofcode.withgoogle.com/programs/2026/projects/m0yTcdhT
  • Link to proposal: https://docs.google.com/document/d/1GYpOwPdoAJVFKlnCuoizQbmdBygtZnIWIoijEeaI-04/edit?usp=sharing
  • I would appreciate if I could have opinions on whether the covariance matrix goes in libpysal or gwlearn(since it might not be required for any other libraries (or maybe the covariance matrix building exists in some library already and i could import that directly in)).
  • current status: Verify if the proposed code logic is suitable, Verify if the proposed tests and .ipynb notebook has correct results.
  • hope i can make progress on this to the point that it is discussable on Thursday's pysal dev meet over discord.

@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.65759% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.55%. Comparing base (a0694d7) to head (97cfb8c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
spatialml/decomposition/pca.py 73.17% 33 Missing ⚠️
spatialml/decomposition/_base.py 92.85% 6 Missing ⚠️
spatialml/base.py 91.66% 2 Missing ⚠️
spatialml/search.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #125      +/-   ##
==========================================
- Coverage   92.39%   90.55%   -1.84%     
==========================================
  Files           6        9       +3     
  Lines         881     1112     +231     
==========================================
+ Hits          814     1007     +193     
- Misses         67      105      +38     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@FirePheonix

Copy link
Copy Markdown
Contributor Author

i'll fi the pre commits + add some proper documentations till then..

@martinfleis

Copy link
Copy Markdown
Member

Thanks! Wait with the documentation until the implementation is stable to avoid additional work.

I recommend install the pre-commit hook to avoid the listing issues.

I've skimmed it and found some places where this could be simplified but will need to do a proper pass when the time permits.

@martinfleis martinfleis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first batch of suggestions/comments.

  1. I would like to understand whether we need weighted_covariance function at all. You know that I was hesitant about it during the call last week. Seeing it now, I am wondering if we could just rely on numpy.cov, drop the entire function and use groupby.apply directly within gwpca implementation. The less code we maintain the better and if we don't have to reimplemented these basic components, let's not do so.
  2. There are many tests, none of which tests the actual correctness. If I replaced the weighted_covariance function with a random array, all tests would pass. Make sure there are tests that verify numerical stability so that any change of the code is tested against expected values, not just shapes etc. I tried to quickly use numpy.cov but without manual verification, I have no feedback from the test suite.

Comment thread gwlearn/base.py Outdated
if self.batch_size:
training_output = []
num_groups = len(y)
num_groups = len(y) if y is not None else len(X)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_groups = len(y) if y is not None else len(X)
num_groups = len(X)

The condition is pointless here.

Comment thread spatialml/base.py
supervised baseline to fit.
"""
if y is None:
return

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Is there a reason we don't fit global PCA in the similar way we fit global estimators?

Comment thread gwlearn/base.py Outdated
"""
# Length checks
if len(X) != len(y):
if y is not None and len(X) != len(y):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is insufficient. We should check for y not based on its value but based on the needs. So if we're using esitmator, we should verify y, even if user passes None by mistake. This is a lazy shortcut :).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. that's actually a whole new edge case, i'll improve the code

Comment thread gwlearn/base.py Outdated
verbose: bool = False,
**kwargs,
):
# No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment in relation to the code.

Comment thread gwlearn/base.py Outdated
**kwargs,
):
# No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against.
kwargs.pop("strict", None)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you popping strict? Where does the assumption that it is there comes from?

Comment thread gwlearn/base.py Outdated
return np.array(results)


class BaseDecomposition(TransformerMixin, _BaseModel):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have this somewhere else. Possibly in the decomposition module directly. You can turn decomposition into a folder with base and pca submodules.

Comment thread gwlearn/search.py Outdated
"""IC metric names included automatically when the model supports them."""
return ["aicc", "aic", "bic"] if self._supports_ic else []
metrics = ["aicc", "aic", "bic"] if self._supports_ic else []
if self.criterion == "cv_score":

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic_metrics stands for information criterion. cv score is not one, we should not deal with this here.

Comment thread gwlearn/search.py Outdated
met = self._ic_metrics.copy()
if self.metrics is not None:
met += self.metrics
if self.criterion == "cv_score" and "cv_score" not in met:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You deal with it here, so the code above is not needed at all, is it?

Comment thread spatialml/search.py
X=X,
y=y,
geometry=self.geometry,
**({"cv": True} if "cv_score" in met else {}),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very opaque line. At least add a comment explaining it.

@FirePheonix

Copy link
Copy Markdown
Contributor Author

alrightt.. i'll do the suggested changes and get back.

@martinfleis martinfleis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't like the format of GWPCA.components_. For any dataset with other than RangeIndex, it makes it cumbersome to link local components to their focal geometry. Also, indexing like gwpca.components_[:, 2, 0] is a pain as you consistently have to keep a mental model of what those unlabelled dimensions of that numpy array mean. Also, the orientation is different than in global PCA, despite your documentation. Global PCA uses (n_components, n_features), you used (n_samples, n_features, n_components,) but the notebook claims (n_samples, n_components, n_features) (the docstring is correct). It is just all too confusing. I am not sure how should it look like though, probably something to discuss tomorrow (cc @sjsrey). Some way of shaping this as pandas objects would be likely preferable.

  2. explained_variance_ratio_ is better as the array is simple but given the package is GeoPandas-oriented and all estimators return pandas objects, I think this should be a properly labelled DataFrame.

  3. Once again, it is unclear to me why don't we fit a global model baseline as we do in regressions.

I did not have enough time to check the bandwidth selection code, but that is secondary anyway.

One other note - I'd like to compare our results to those from {GWmodel::gwpca} R implementation. It is a reference we should match.

A side note - when pushing a bunch of commits, it would be good if you added a short comment summarising what have you done.

Comment thread spatialml/decomposition/_base.py Outdated
Comment on lines +109 to +112
# ``strict`` is accepted so that BandwidthSearch (which passes it to
# every model it creates) does not raise a TypeError. Decompositions
# have no notion of invariant y, so the value is always ignored.
strict: bool | None = False, # noqa: ARG002

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should deal with this within BandwidthSearch, not here. search should detect that it is used for decomposition and ignore strict keyword there.

Comment thread spatialml/decomposition/_base.py Outdated
Comment on lines +76 to +77
scores_ : numpy.ndarray
Focal-point projections, shape ``(n_locations, n_components)``.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it that this terminology comes from the R implementation but we should also use the sklearn terminology here, which uses "Transformed values."

Comment thread spatialml/decomposition/_base.py Outdated
Comment on lines +78 to +79
local_means_ : numpy.ndarray
Weighted local means, shape ``(n_locations, n_features)``.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear from the description what this is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand from the code that it is weighted local mean of X. Why are we reporting it?

Comment thread spatialml/decomposition/_base.py Outdated
X: pd.DataFrame,
geometry: gpd.GeoSeries | None = None,
) -> np.ndarray:
"""Project ``X`` onto local components via nearest-neighbour lookup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should take the logic of predict in estimators, not just a nearest-neighbor. Eventually.

Comment thread spatialml/decomposition/pca.py Outdated
Comment thread spatialml/ensemble.py Outdated
self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None
self,
X: pd.DataFrame,
y: pd.Series | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y: pd.Series | None = None,
y: pd.Series,

Comment thread spatialml/ensemble.py Outdated
self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None
self,
X: pd.DataFrame,
y: pd.Series | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y: pd.Series | None = None,
y: pd.Series,

Comment thread spatialml/ensemble.py Outdated
self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None
self,
X: pd.DataFrame,
y: pd.Series | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y: pd.Series | None = None,
y: pd.Series,

Comment thread spatialml/linear_model.py Outdated
def fit(
self,
X: pd.DataFrame,
y: pd.Series | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y: pd.Series | None = None,
y: pd.Series,

Comment thread spatialml/linear_model.py Outdated
def fit(
self,
X: pd.DataFrame,
y: pd.Series | None = None,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y: pd.Series | None = None,
y: pd.Series,

@FirePheonix

Copy link
Copy Markdown
Contributor Author

hey @martinfleis i actually am in the middle of refactoring all these places😅
was just looking at some weird comments i put during development and the None=none removals.

"One other note - I'd like to compare our results to those from {GWmodel::gwpca} R implementation. It is a reference we should match."

Absolutely, I've thought of that too.
I'm also looking into those TESTS specifically that'd help directly from that package right now.

Per reviewer feedback, remove the y: pd.Series | None = None
default from all concrete supervised estimators.  y is always
required for classification and regression; the None default was
wrong.  Also drop the now-redundant �ssert y is not None guard
in BaseClassifier.fit that was only needed while y was typed as
optional.

Affected classes:
- BaseClassifier.fit / BaseRegressor.fit (base.py)
- GWRandomForestClassifier, GWGradientBoostingClassifier,
  GWRandomForestRegressor, GWGradientBoostingRegressor (ensemble.py)
- GWLogisticRegression, GWLinearRegression (linear_model.py)
@FirePheonix

Copy link
Copy Markdown
Contributor Author

8dfa9f6 — Enforce y: pd.Series in supervised fit() signatures

  • Removed | None = None from y in all 8 concrete supervised fit() methods
  • Dropped the redundant assert y is not None in BaseClassifier.fit

97cfb8c — Clean up decomposition API

  • Renamed _is_decomposition_requires_y (generic: True supervised, False decompositions)
  • Moved strict handling into BandwidthSearch._score; decompositions no longer accept it
  • Fixed _BaseModel.fit() stub: y: pd.Series (not optional)
  • Replaced assert y is not None in search.py with ValueError
  • Improved scores_ docstring: sklearn terminology ("Transformed values")
  • Clarified local_means_ docstring: weighted local mean of X, needed by transform()
  • Added .. note:: to transform() on nearest-neighbour limitation + future plan

Comment on lines +79 to +83
local_means_ : numpy.ndarray
Weighted local mean of ``X`` per focal location,
shape ``(n_locations, n_features)``. Stored so that
:meth:`transform` can centre new observations against the same
local mean before projecting onto the local components.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That more sounds like a private thing than something exposed to a user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agreee...
i'm planning the refactor the doc strings for the whole of the files i've changed right now.
and let's discuss:

I don't like the format of GWPCA.components_. For any dataset with other than RangeIndex, it makes it cumbersome to link local components to their focal geometry. Also, indexing like gwpca.components_[:, 2, 0] is a pain as you consistently have to keep a mental model of what those unlabelled dimensions of that numpy array mean. Also, the orientation is different than in global PCA, despite your documentation. Global PCA uses (n_components, n_features), you used (n_samples, n_features, n_components,) but the notebook claims (n_samples, n_components, n_features) (the docstring is correct). It is just all too confusing. I am not sure how should it look like though, probably something to discuss tomorrow (cc sjsrey). Some way of shaping this as pandas objects would be likely preferable.

this in today's meet. what i can think of: the corresponding R package (and Sklearn api too) is a 2D array and not 3D. maybe i should multi index it? i'm thinking of THIS...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants