GSoC-2026: GWPCA Prototyped by FirePheonix · Pull Request #125 · pysal/spml

FirePheonix · 2026-06-06T20:43:01Z

the following Pull Request is GWPCA's proposed core prototype.
as suggested by @martinfleis , I'm putting a working version of gwpca with a sampled working notebook, tested on my local machine.

this PR completes the rough proposed work (i've completed till week 3 in the proposal), timeline wise (can adjust timeline any time)we're on week 2, but i believe it will be GREAT if we can have improvements discussions and documentation discussions over the code changes, gradually over the next 10 whole days (so we can have suggestion -> i improve -> suggestion -> improve)
Link to project details: https://summerofcode.withgoogle.com/programs/2026/projects/m0yTcdhT
Link to proposal: https://docs.google.com/document/d/1GYpOwPdoAJVFKlnCuoizQbmdBygtZnIWIoijEeaI-04/edit?usp=sharing
I would appreciate if I could have opinions on whether the covariance matrix goes in libpysal or gwlearn(since it might not be required for any other libraries (or maybe the covariance matrix building exists in some library already and i could import that directly in)).
current status: Verify if the proposed code logic is suitable, Verify if the proposed tests and .ipynb notebook has correct results.
hope i can make progress on this to the point that it is discussable on Thursday's pysal dev meet over discord.

codecov · 2026-06-06T20:45:26Z

Codecov Report

❌ Patch coverage is 83.65759% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.55%. Comparing base (a0694d7) to head (97cfb8c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
spatialml/decomposition/pca.py	73.17%	33 Missing ⚠️
spatialml/decomposition/_base.py	92.85%	6 Missing ⚠️
spatialml/base.py	91.66%	2 Missing ⚠️
spatialml/search.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #125      +/-   ##
==========================================
- Coverage   92.39%   90.55%   -1.84%     
==========================================
  Files           6        9       +3     
  Lines         881     1112     +231     
==========================================
+ Hits          814     1007     +193     
- Misses         67      105      +38

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

FirePheonix · 2026-06-06T20:46:41Z

i'll fi the pre commits + add some proper documentations till then..

martinfleis · 2026-06-07T12:53:04Z

Thanks! Wait with the documentation until the implementation is stable to avoid additional work.

I recommend install the pre-commit hook to avoid the listing issues.

I've skimmed it and found some places where this could be simplified but will need to do a proper pass when the time permits.

martinfleis

The first batch of suggestions/comments.

I would like to understand whether we need weighted_covariance function at all. You know that I was hesitant about it during the call last week. Seeing it now, I am wondering if we could just rely on numpy.cov, drop the entire function and use groupby.apply directly within gwpca implementation. The less code we maintain the better and if we don't have to reimplemented these basic components, let's not do so.
There are many tests, none of which tests the actual correctness. If I replaced the weighted_covariance function with a random array, all tests would pass. Make sure there are tests that verify numerical stability so that any change of the code is tested against expected values, not just shapes etc. I tried to quickly use numpy.cov but without manual verification, I have no feedback from the test suite.

martinfleis · 2026-06-10T07:20:36Z

        if self.batch_size:
            training_output = []
-            num_groups = len(y)
+            num_groups = len(y) if y is not None else len(X)


Suggested change

num_groups = len(y) if y is not None else len(X)

num_groups = len(X)

The condition is pointless here.

martinfleis · 2026-06-10T07:21:40Z

+        supervised baseline to fit.
+        """
+        if y is None:
+            return


Why? Is there a reason we don't fit global PCA in the similar way we fit global estimators?

martinfleis · 2026-06-10T07:23:02Z

        """
        # Length checks
-        if len(X) != len(y):
+        if y is not None and len(X) != len(y):


This is insufficient. We should check for y not based on its value but based on the needs. So if we're using esitmator, we should verify y, even if user passes None by mistake. This is a lazy shortcut :).

hmmm. that's actually a whole new edge case, i'll improve the code

martinfleis · 2026-06-10T07:23:55Z

+        verbose: bool = False,
+        **kwargs,
+    ):
+        # No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against.


I don't understand this comment in relation to the code.

martinfleis · 2026-06-10T07:24:25Z

+        **kwargs,
+    ):
+        # No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against.
+        kwargs.pop("strict", None)


Why are you popping strict? Where does the assumption that it is there comes from?

martinfleis · 2026-06-10T07:26:03Z

        return np.array(results)


+class BaseDecomposition(TransformerMixin, _BaseModel):


I would prefer to have this somewhere else. Possibly in the decomposition module directly. You can turn decomposition into a folder with base and pca submodules.

martinfleis · 2026-06-10T07:29:38Z

        """IC metric names included automatically when the model supports them."""
-        return ["aicc", "aic", "bic"] if self._supports_ic else []
+        metrics = ["aicc", "aic", "bic"] if self._supports_ic else []
+        if self.criterion == "cv_score":


ic_metrics stands for information criterion. cv score is not one, we should not deal with this here.

martinfleis · 2026-06-10T07:29:59Z

        met = self._ic_metrics.copy()
        if self.metrics is not None:
            met += self.metrics
+        if self.criterion == "cv_score" and "cv_score" not in met:


You deal with it here, so the code above is not needed at all, is it?

martinfleis · 2026-06-10T07:30:51Z

+            X=X,
+            y=y,
+            geometry=self.geometry,
+            **({"cv": True} if "cv_score" in met else {}),


This is a very opaque line. At least add a comment explaining it.

FirePheonix · 2026-06-10T09:47:58Z

alrightt.. i'll do the suggested changes and get back.

…andwidth_ typing

…ornia Housing

martinfleis

I don't like the format of GWPCA.components_. For any dataset with other than RangeIndex, it makes it cumbersome to link local components to their focal geometry. Also, indexing like gwpca.components_[:, 2, 0] is a pain as you consistently have to keep a mental model of what those unlabelled dimensions of that numpy array mean. Also, the orientation is different than in global PCA, despite your documentation. Global PCA uses (n_components, n_features), you used (n_samples, n_features, n_components,) but the notebook claims (n_samples, n_components, n_features) (the docstring is correct). It is just all too confusing. I am not sure how should it look like though, probably something to discuss tomorrow (cc @sjsrey). Some way of shaping this as pandas objects would be likely preferable.
explained_variance_ratio_ is better as the array is simple but given the package is GeoPandas-oriented and all estimators return pandas objects, I think this should be a properly labelled DataFrame.
Once again, it is unclear to me why don't we fit a global model baseline as we do in regressions.

I did not have enough time to check the bandwidth selection code, but that is secondary anyway.

One other note - I'd like to compare our results to those from {GWmodel::gwpca} R implementation. It is a reference we should match.

A side note - when pushing a bunch of commits, it would be good if you added a short comment summarising what have you done.

martinfleis · 2026-06-22T07:49:19Z

+        # ``strict`` is accepted so that BandwidthSearch (which passes it to
+        # every model it creates) does not raise a TypeError.  Decompositions
+        # have no notion of invariant y, so the value is always ignored.
+        strict: bool | None = False,  # noqa: ARG002


You should deal with this within BandwidthSearch, not here. search should detect that it is used for decomposition and ignore strict keyword there.

martinfleis · 2026-06-24T09:27:02Z

+    scores_ : numpy.ndarray
+        Focal-point projections, shape ``(n_locations, n_components)``.


I get it that this terminology comes from the R implementation but we should also use the sklearn terminology here, which uses "Transformed values."

martinfleis · 2026-06-24T09:29:33Z

+    local_means_ : numpy.ndarray
+        Weighted local means, shape ``(n_locations, n_features)``.


It is unclear from the description what this is.

I understand from the code that it is weighted local mean of X. Why are we reporting it?

martinfleis · 2026-06-24T09:31:20Z

+        X: pd.DataFrame,
+        geometry: gpd.GeoSeries | None = None,
+    ) -> np.ndarray:
+        """Project ``X`` onto local components via nearest-neighbour lookup.


I think this should take the logic of predict in estimators, not just a nearest-neighbor. Eventually.

martinfleis · 2026-06-24T10:27:04Z

-        self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None
+        self,
+        X: pd.DataFrame,
+        y: pd.Series | None = None,