GSoC-2026: GWPCA Prototyped #125
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #125 +/- ##
==========================================
- Coverage 92.39% 90.55% -1.84%
==========================================
Files 6 9 +3
Lines 881 1112 +231
==========================================
+ Hits 814 1007 +193
- Misses 67 105 +38 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
i'll fi the pre commits + add some proper documentations till then.. |
|
Thanks! Wait with the documentation until the implementation is stable to avoid additional work. I recommend install the pre-commit hook to avoid the listing issues. I've skimmed it and found some places where this could be simplified but will need to do a proper pass when the time permits. |
martinfleis
left a comment
There was a problem hiding this comment.
The first batch of suggestions/comments.
- I would like to understand whether we need
weighted_covariancefunction at all. You know that I was hesitant about it during the call last week. Seeing it now, I am wondering if we could just rely onnumpy.cov, drop the entire function and usegroupby.applydirectly within gwpca implementation. The less code we maintain the better and if we don't have to reimplemented these basic components, let's not do so. - There are many tests, none of which tests the actual correctness. If I replaced the weighted_covariance function with a random array, all tests would pass. Make sure there are tests that verify numerical stability so that any change of the code is tested against expected values, not just shapes etc. I tried to quickly use
numpy.covbut without manual verification, I have no feedback from the test suite.
| if self.batch_size: | ||
| training_output = [] | ||
| num_groups = len(y) | ||
| num_groups = len(y) if y is not None else len(X) |
There was a problem hiding this comment.
| num_groups = len(y) if y is not None else len(X) | |
| num_groups = len(X) |
The condition is pointless here.
| supervised baseline to fit. | ||
| """ | ||
| if y is None: | ||
| return |
There was a problem hiding this comment.
Why? Is there a reason we don't fit global PCA in the similar way we fit global estimators?
| """ | ||
| # Length checks | ||
| if len(X) != len(y): | ||
| if y is not None and len(X) != len(y): |
There was a problem hiding this comment.
This is insufficient. We should check for y not based on its value but based on the needs. So if we're using esitmator, we should verify y, even if user passes None by mistake. This is a lazy shortcut :).
There was a problem hiding this comment.
hmmm. that's actually a whole new edge case, i'll improve the code
| verbose: bool = False, | ||
| **kwargs, | ||
| ): | ||
| # No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against. |
There was a problem hiding this comment.
I don't understand this comment in relation to the code.
| **kwargs, | ||
| ): | ||
| # No wrapped supervised model: pass ``None`` — decomposition has no ``y`` to fit against. | ||
| kwargs.pop("strict", None) |
There was a problem hiding this comment.
Why are you popping strict? Where does the assumption that it is there comes from?
| return np.array(results) | ||
|
|
||
|
|
||
| class BaseDecomposition(TransformerMixin, _BaseModel): |
There was a problem hiding this comment.
I would prefer to have this somewhere else. Possibly in the decomposition module directly. You can turn decomposition into a folder with base and pca submodules.
| """IC metric names included automatically when the model supports them.""" | ||
| return ["aicc", "aic", "bic"] if self._supports_ic else [] | ||
| metrics = ["aicc", "aic", "bic"] if self._supports_ic else [] | ||
| if self.criterion == "cv_score": |
There was a problem hiding this comment.
ic_metrics stands for information criterion. cv score is not one, we should not deal with this here.
| met = self._ic_metrics.copy() | ||
| if self.metrics is not None: | ||
| met += self.metrics | ||
| if self.criterion == "cv_score" and "cv_score" not in met: |
There was a problem hiding this comment.
You deal with it here, so the code above is not needed at all, is it?
| X=X, | ||
| y=y, | ||
| geometry=self.geometry, | ||
| **({"cv": True} if "cv_score" in met else {}), |
There was a problem hiding this comment.
This is a very opaque line. At least add a comment explaining it.
|
alrightt.. i'll do the suggested changes and get back. |
There was a problem hiding this comment.
-
I don't like the format of
GWPCA.components_. For any dataset with other than RangeIndex, it makes it cumbersome to link local components to their focal geometry. Also, indexing likegwpca.components_[:, 2, 0]is a pain as you consistently have to keep a mental model of what those unlabelled dimensions of that numpy array mean. Also, the orientation is different than in global PCA, despite your documentation. Global PCA uses(n_components, n_features), you used(n_samples, n_features, n_components,)but the notebook claims(n_samples, n_components, n_features)(the docstring is correct). It is just all too confusing. I am not sure how should it look like though, probably something to discuss tomorrow (cc @sjsrey). Some way of shaping this as pandas objects would be likely preferable. -
explained_variance_ratio_is better as the array is simple but given the package is GeoPandas-oriented and all estimators return pandas objects, I think this should be a properly labelled DataFrame. -
Once again, it is unclear to me why don't we fit a global model baseline as we do in regressions.
I did not have enough time to check the bandwidth selection code, but that is secondary anyway.
One other note - I'd like to compare our results to those from {GWmodel::gwpca} R implementation. It is a reference we should match.
A side note - when pushing a bunch of commits, it would be good if you added a short comment summarising what have you done.
| # ``strict`` is accepted so that BandwidthSearch (which passes it to | ||
| # every model it creates) does not raise a TypeError. Decompositions | ||
| # have no notion of invariant y, so the value is always ignored. | ||
| strict: bool | None = False, # noqa: ARG002 |
There was a problem hiding this comment.
You should deal with this within BandwidthSearch, not here. search should detect that it is used for decomposition and ignore strict keyword there.
| scores_ : numpy.ndarray | ||
| Focal-point projections, shape ``(n_locations, n_components)``. |
There was a problem hiding this comment.
I get it that this terminology comes from the R implementation but we should also use the sklearn terminology here, which uses "Transformed values."
| local_means_ : numpy.ndarray | ||
| Weighted local means, shape ``(n_locations, n_features)``. |
There was a problem hiding this comment.
It is unclear from the description what this is.
There was a problem hiding this comment.
I understand from the code that it is weighted local mean of X. Why are we reporting it?
| X: pd.DataFrame, | ||
| geometry: gpd.GeoSeries | None = None, | ||
| ) -> np.ndarray: | ||
| """Project ``X`` onto local components via nearest-neighbour lookup. |
There was a problem hiding this comment.
I think this should take the logic of predict in estimators, not just a nearest-neighbor. Eventually.
| self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None | ||
| self, | ||
| X: pd.DataFrame, | ||
| y: pd.Series | None = None, |
There was a problem hiding this comment.
| y: pd.Series | None = None, | |
| y: pd.Series, |
| self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None | ||
| self, | ||
| X: pd.DataFrame, | ||
| y: pd.Series | None = None, |
There was a problem hiding this comment.
| y: pd.Series | None = None, | |
| y: pd.Series, |
| self, X: pd.DataFrame, y: pd.Series, geometry: gpd.GeoSeries | None = None | ||
| self, | ||
| X: pd.DataFrame, | ||
| y: pd.Series | None = None, |
There was a problem hiding this comment.
| y: pd.Series | None = None, | |
| y: pd.Series, |
| def fit( | ||
| self, | ||
| X: pd.DataFrame, | ||
| y: pd.Series | None = None, |
There was a problem hiding this comment.
| y: pd.Series | None = None, | |
| y: pd.Series, |
| def fit( | ||
| self, | ||
| X: pd.DataFrame, | ||
| y: pd.Series | None = None, |
There was a problem hiding this comment.
| y: pd.Series | None = None, | |
| y: pd.Series, |
|
hey @martinfleis i actually am in the middle of refactoring all these places😅 "One other note - I'd like to compare our results to those from {GWmodel::gwpca} R implementation. It is a reference we should match." Absolutely, I've thought of that too. |
Per reviewer feedback, remove the y: pd.Series | None = None default from all concrete supervised estimators. y is always required for classification and regression; the None default was wrong. Also drop the now-redundant �ssert y is not None guard in BaseClassifier.fit that was only needed while y was typed as optional. Affected classes: - BaseClassifier.fit / BaseRegressor.fit (base.py) - GWRandomForestClassifier, GWGradientBoostingClassifier, GWRandomForestRegressor, GWGradientBoostingRegressor (ensemble.py) - GWLogisticRegression, GWLinearRegression (linear_model.py)
|
| local_means_ : numpy.ndarray | ||
| Weighted local mean of ``X`` per focal location, | ||
| shape ``(n_locations, n_features)``. Stored so that | ||
| :meth:`transform` can centre new observations against the same | ||
| local mean before projecting onto the local components. |
There was a problem hiding this comment.
That more sounds like a private thing than something exposed to a user.
There was a problem hiding this comment.
i agreee...
i'm planning the refactor the doc strings for the whole of the files i've changed right now.
and let's discuss:
I don't like the format of GWPCA.components_. For any dataset with other than RangeIndex, it makes it cumbersome to link local components to their focal geometry. Also, indexing like gwpca.components_[:, 2, 0] is a pain as you consistently have to keep a mental model of what those unlabelled dimensions of that numpy array mean. Also, the orientation is different than in global PCA, despite your documentation. Global PCA uses (n_components, n_features), you used (n_samples, n_features, n_components,) but the notebook claims (n_samples, n_components, n_features) (the docstring is correct). It is just all too confusing. I am not sure how should it look like though, probably something to discuss tomorrow (cc sjsrey). Some way of shaping this as pandas objects would be likely preferable.
this in today's meet. what i can think of: the corresponding R package (and Sklearn api too) is a 2D array and not 3D. maybe i should multi index it? i'm thinking of THIS...
Uh oh!
There was an error while loading. Please reload this page.