Create client-side ordinations to avoid server-memory issues #4

grovduck · 2023-10-31T00:43:12Z

This PR addresses #3 by leveraging sknnr to provide estimators for this package. As noted in #3, running the estimators client-side will "interrupt" the GEE server-side flow and thus run training only once for all targets. It also removes the duplication of implementing these ordination estimators in more than one repository.

This is still very much a work in progress and I will be addressing the following issues:

~~Create a base class for all estimators to reduce duplicated code across estimators~~
Create client-side proxies of GEE feature collection objects so that the estimators can be run client-side
Following the design of sknnr, make Raw the base class
Introduce a derived class of Raw (currently called Transformed) to be the base class for all estimators that use transformations
Use sknnr transformation objects to handle client-side ordination
Move predict and predict_fc to Transformed and remove from derived classes
Create @abstractmethod methods in Transformed to handle transformations and implement in derived classes
Run tests
Cleanup (more tightening and class renaming)

As this PR turned out to be very ambitious, there are a few tasks that I decided not to tackle with this PR and turn into separate issues:

Rework Colocation class so that user doesn't need to instantiate
Test with other n_components and spp_transform values on GNN to ensure correctness
Add mode keyword on feature collections to get distances and and tests
Recreate spatial surfaces using sknnr-spatial and this training data to see if we can increase the number of expected matches of pixels in the test surfaces
Fail quickly if id_field does not store an integer type

* Added method (__init__) and property (k_nearest) used across subclasses.

@aazuspan

In order to run ordinations client-side, there needs to be a mechanism for translating `ee.FeatureCollections` to python objects (in this case, pydantic models). Based on work from @aazuspan, these classes provide multi-threaded functionality to convert feature collections in a lower-memory context. `Feature` and `FeatureCollection` have a few helper functions to extract properties as values and lists.

grovduck · 2023-10-31T17:16:35Z

@aazuspan, a stupid question. I thought using:

from __future__ import annotations

would allow the use of list and dict (lowercase) as type hints (i.e., from PEP 585). I'm not sure I remember how you got this to work in sknnr, but my tests are failing here and I think this is the reason. I'm using this with pydantic.BaseModel derived classes like this:

from __future__ import annotations
from typing import Any
from pydantic import BaseModel


class Geometry(BaseModel):
    """Client-side proxy for ee.Geometry object"""

    type: str
    coordinates: list[Any]

Any help would be very much appreciated!

aazuspan · 2023-10-31T17:39:45Z

Interesting! If I'm remembering right, the __future__ import works by effectively stringifying the type annotations to avoid the interpreter trying to evaluate incompatible ones. Looking at the failed test, I think pydantic is forcing those to be evaluated anyways, which is breaking the workaround.

A little short on details, but pydantic/pydantic#2112 seems to be describing the same issue. On the opposite end of the spectrum, pydantic/pydantic#2678 has a lot of detail, although I haven't read deeply enough into that discussion to figure out what the conclusion was or whether it's relevant to backwards compatibility...

grovduck · 2023-10-31T18:58:19Z

@aazuspan , thanks for your help finding those issues. I'm not sure I understand much better than I did before other than it seems I'll have issues using list and dict in pydantic classes and trying to get 3.8 to pass. If I go back to using typing.List and typing.Dict, I get all kinds of complaints from the flake8-future-annotations check. Do you have any opinion on either:

dropping support for 3.8
using dataclasses instead of pydantic BaseModel
disabling the "FA" check

Typing is great to have, but kind of a pain as well!

aazuspan · 2023-10-31T19:34:32Z

I'd be tempted to drop 3.8... I know we considered that with sknnr and held off, but now we're 6 months past Numpy's drop date, and it would sure simplify some things. With that said, option 3 would also be quick fix, so I wouldn't hesitate to do that if you have reservations about limiting Python version support.

Typing is great to have, but kind of a pain as well!

💯

This commit implements client-side fitting of four of the five estimators (excluding Raw) and implements them using sknnr transformers associated with each method. For this commit only, we have introduced a new method in each estimator called `train_client` whilst retaining the original `train` which is a server-side train/fit. The general workflow is to convert the training data into a client-side feature collection (class FeatureCollection), fit the model using the transformer, and convert all needed data objects back into server-side objects to work with `predict` and `predict_fc`. For all but GNN, `predict` and `predict_fc` are unchanged and tests are passing for both client-side and server-side training. For GNN, there were slight modifications to `predict` and `predict_fc` in the transformations such that only client-side fitting is working. There are expected test failures in this commit. In addition, we have dropped support for python 3.8 in this commit as types `list` and `dict` were not compatible with pydantic BaseModel-derived classes for 3.8. Still, much refactoring needs to be done to remove duplication across the methods.

grovduck · 2023-10-31T21:29:47Z

I'd be tempted to drop 3.8... I know we considered that with sknnr and held off, but now we're 6 months past Numpy's drop date, and it would sure simplify some things

Great, thanks for your input. I've dropped 3.8 (and added 3.12) to this commit.

This commit changes the class hierachy to reflect that of `sknnr`. The `GeeKnnClassifier` class has been renamed to `Raw`, a new abstract class called `Transformed` derives from `Raw` and all other estimators (`Eucliean`, `Mahalanobis`, `MSN`, and `GNN`) are all subclasses of `Transformed`. `Transformed` does most of the work in training and predicting the estimators and four abstract methods delegate the responsiblity of defining the transformer to the subclasses. Additionally, `cca.py` and `ccora.py` are no longer needed as all model fitting is now done through `sknnr`.

grovduck · 2023-11-02T20:52:17Z

@aazuspan, if you have the time (and can stomach it), I'd love for you to take a quick look at the changes here. The general thrust of this PR is to bring in the sknnr transformers ¹ into this package to do client-side fitting (training). I didn't at all hew to the best-practices of nice, small, atomic commits so it's a bit to wade through, but the general pattern follows closely to what you've done with sknnr.

Raw is the main base class and implements train, predict (spatial prediction), and predict_fc (feature collection [point] prediction).
Transformed inherits from Raw but is an abstract class. Its abstract methods set the protocol for doing the transformations which are delegated to the subclasses (Euclidean, Mahalanobis, MSN, and GNN). It also implements transform_image and transform_fc which are called from predict and predict_fc, respectively. Finally, it overrides train to do the fitting and the relevant objects are saved as EE server-side objects.

There are a few issues I know of at this point:

Class names are bad (as they were originally in sknnr). But given that the core neighbor finder is called ee.Classifier.minimumDistance, I felt a bit weird about calling these regressors again. But maybe it's better to keep them consistent with sknnr rather than with GEE.
I have a couple of no-op methods in Raw (transform_image and transform_fc) which just pass back the input argument. Because I'm calling these methods from Raw.predict and Raw.predict_fc, mypy complained that these methods weren't originally defined in Raw. I can't make those methods abstract without making all of Raw abstract, which would defeat the purpose. Do you have a better way of defining these or changing predict/predict_fc so that it avoids this issue?
You'll see some weird values in the test_model_spatial.py file in ESTIMATOR_PARAMETERS (the last value in the tuple). These represent the minimum number of expected matches between some check images that I created a while ago and the neighbor images created through the test, 360,000 being the maximum for a 600x600 image. I honestly can't remember how I created those test images because pynnmap didn't have the capacity to run the different methods, so I'm guessing I did it with yaImpute. But I think those check images might be worth recreating with sknnr_spatial to see if we can get higher values.

I know it's not in your nature, but please be brutally honest if you see some funky stuff and additional opportunities for refactoring. The good thing is that we're currently passing tests!

We only need the transformers as GEE's ee.Classifier.minimumDistance is reponsible for neighbor finding. The transformers are used to set the ordination space. ↩

aazuspan · 2023-11-02T21:40:11Z

Absolutely, I'm excited to take a look and get a better idea of how this works! I tried running tests and it looks like I can access the table assets, but not the images. Do you mind sharing those?

FAILED tests/test_model_spatial.py::test_image_match[5-raw] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/raw_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-euc] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/euc_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-mah] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/mah_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-msn] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/msn_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-gnn] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/gnn_neighbors_600' not found (does not exist or caller does not have access).

grovduck · 2023-11-02T22:07:50Z

Do you mind sharing those?

Sorry about that. Should be shared with your gmail account now. Let me know if you have further issues.

aazuspan · 2023-11-02T22:20:21Z

All tests passing now, thanks!

aazuspan

There's a lot of really cool stuff going on here! I'm still trying to grok how it all fits together, so let me know if any of my comments are totally off base, but I made a first pass at least.

I felt a bit weird about calling these regressors again. But maybe it's better to keep them consistent with sknnr rather than with GEE.

Choosing names remains the hardest part of programming! I'm not sure that staying consistent with sknnr is necessarily the perfect choice, but it does seem like a defensible one at least, and would make it easy for someone familiar with one package to adapt to the other (although if you ever tried using them in a single script there might be some namespace headaches...). I think that's where I would lean by default, just because I don't have any better ideas.

tests/setup.py

tests/test_model_spatial.py

tests/setup.py

src/geeknn/ordination/utils.py

src/geeknn/ordination/gnn.py

src/geeknn/ordination/euclidean.py

src/geeknn/ordination/_base.py

grovduck · 2023-11-06T18:25:31Z

but I made a first pass at least

@aazuspan, thanks for such a thorough review. I'll be picking through it over the next couple of days (then back to sknnr!)

grovduck

@aazuspan, I think I had started a review myself, but had never submitted it. Hopefully you can see my responses to your comments.

src/geeknn/ordination/_base.py

src/geeknn/ordination/gnn.py

aazuspan · 2023-11-07T20:46:14Z

Hopefully you can see my responses to your comments.

I do see a couple, but I got an email notification about a bunch of comments that aren't showing up for me, e.g. your answer about retile or the int IDs... Not quite sure what's going on there.

grovduck · 2023-11-07T21:03:41Z

I do see a couple, but I got an email notification about a bunch of comments that aren't showing up for me, e.g. your answer about retile or the int IDs... Not quite sure what's going on there.

I really messed everything up. But I think here is what happened (chronologically):

I started a review right after I initiated the PR but didn't submit it.
You provided your review.
I started responding to your comments, but they were in the context of my ongoing review.
I finally realized that I hadn't yet submitted my initial review.
All my responses to you got duplicated in my review and had no context (e.g., your original comment) in my review
I deleted all my duplicated comments, making sure that they stayed when in response to your comments.
Upon refresh, all my responses disappeared.

So, I'll try to go back and put in what I had written (I don't think deleted comments can be retrieved), but probably best to use the emails that were sent as the definitive source. I think the crux is that one shouldn't respond when an active review is ongoing. Sorry about this!

aazuspan · 2023-11-07T21:18:20Z

Oh no, that's a hassle to have to re-enter all your responses. I think I have the text of all your responses in the notification emails, I just don't necessarily know what they were responses to (although I can probably guess in most cases). Do you have those, or would it be helpful if I forwarded the email?

Github is a pretty smooth experience overall, but figuring out where/if review comments are going to show up seems like it's always a little bit of a gamble, so you're not alone!

grovduck · 2023-11-07T21:35:18Z

I think I have the text of all your responses in the notification emails, I just don't necessarily know what they were responses to (although I can probably guess in most cases). Do you have those, or would it be helpful if I forwarded the email?

If you have those and it would be easy enough to send, that would be great. I don't get a copy of changes I make.

- Remove no-op `Raw.transform_image` and `Raw.transform_fc` - Convert `Raw.predict`, `Transformed.predict`, and `Transformed.transform` into `singledispatchmethod`s so that image and feature collection transformation and prediction comes from `transform` and `predict`, respectively.

grovduck · 2023-11-09T18:54:09Z

Choosing names remains the hardest part of programming! I'm not sure that staying consistent with sknnr is necessarily the perfect choice, but it does seem like a defensible one at least, and would make it easy for someone familiar with one package to adapt to the other (although if you ever tried using them in a single script there might be some namespace headaches...). I think that's where I would lean by default, just because I don't have any better ideas.

I know it's not what you suggested, but I'm considering the following name changes:

Raw -> RawKNNClassifier
Euclidean -> EuclideanKNNClassifier
Mahalanobis -> MahalanobisKNNClassifier
MSN -> MSNClassifier
GNN -> GNNClassifier

I'm thinking of this from the following perspective:

With sknnr, we chose our naming off the main estimator we were deriving - KNeighborsRegressors. With this package, I'm worried if we call them regressors, users will be more confused by that terminology rather than the more familiar Classifier terminology. The term isn't completely accurate, but GEE calls their estimators classifiers even if run in regression mode.
If we ever did use the two packages together (something like speed testing), we'd avoid the naming conflicts that you mention.

Thoughts?

aazuspan · 2023-11-09T21:09:10Z

The term isn't completely accurate, but GEE calls their estimators classifiers even if run in regression mode.

Good point!

I think your logic makes sense, and I'm happy to go with the Classifier naming scheme. As you suggested, that should be a more familiar interface for users with some Earth Engine ML experience, who are probably the most likely to find and use this.

…medKNNClassifier.predict`

grovduck · 2023-11-10T00:51:05Z

@aazuspan, you've been incredibly gracious with your time on reviewing this PR. Thank you so much. I'm running out of steam a bit and will migrate these issues you've identified below into separate issues.

Rework Colocation class so that user doesn't need to instantiate
Test with other n_components and spp_transform values on GNN to ensure correctness
Add mode keyword on feature collections to get distances and and tests
Recreate spatial surfaces using sknnr-spatial and this training data to see if we can increase the number of expected matches of pixels in the test surfaces
Fail quickly if id_field does not store an integer type

There is still a bit of work to do, but I'd like to merge in this PR now, mostly because I need to test with the larger SERVIR workflow this coming week. I've learned that this was much too ambitious of a PR on its own!

aazuspan · 2023-11-10T21:33:39Z

Sounds great @grovduck! Thanks for helping me get up to speed on this. I'm excited to start exploring it with some real data!

grovduck · 2023-11-13T21:31:25Z

Sounds great @grovduck! Thanks for helping me get up to speed on this. I'm excited to start exploring it with some real data!

@aazuspan, even though there are still many things to do on this, I wouldn't have gotten nearly as far without your helpful reviews and fixes. I appreciate it!

grovduck added 2 commits October 23, 2023 15:13

New base class (GeeKnnClassifier)

cabd0d6

* Added method (__init__) and property (k_nearest) used across subclasses.

grovduck self-assigned this Oct 31, 2023

grovduck added the bug Something isn't working label Oct 31, 2023

Change lists of env_columns and spp_columns to be client-side objects

5aaf496

grovduck added 2 commits October 31, 2023 13:42

Fix sklean repository protocol

e003b97

grovduck requested a review from aazuspan November 2, 2023 20:52

aazuspan reviewed Nov 3, 2023

View reviewed changes

grovduck added 3 commits November 6, 2023 11:57

Rename setup.py to conftest.py and move test fixtures into it.

43e06f6

Remove tileScale from test_model_spatial::run_method.

85f6867

Fix client-side objects to accept null fields for some properties.

dec7372

grovduck commented Nov 7, 2023

View reviewed changes

grovduck added 3 commits November 7, 2023 11:10

Change argument signature of GNN.__init__ and rename attributes

ce16af4

Remove get_ids methods and inline the functionality where it's called

3e47fd3

Make clf local in Raw.predict rather than overwriting self

c7bd348

Change Geometry | None to Optional[Geometry] for pydantic

354145a

grovduck added 6 commits November 7, 2023 13:42

Change all instances of | to Optional[] for pydantic

5e2b1d7

Add scikit-learn as a dependency`

ccf9f24

Rename return_k_neighbors to get_k_neighbors

1ae6735

Comments and docstrings cleanup

bba1c75

Initialize minimumDistance classifier in __init__ with later training

52c1c56

grovduck added 5 commits November 9, 2023 15:05

Add lru_cache on FeatureCollection.from_ee_feature_collection

c19c568

Change y_transform to take Callable instead of str

1f50dbe

Rename all estimator to specify that they are NN classifiers

70f84bc

Move all modules out of the ordination subpackage and into the root

f296cfe

Copy identical docstring from RawKNNClassifier.predict to `Transfor…

6f632d0

…medKNNClassifier.predict`

grovduck merged commit 1798a50 into fb-add-methods Nov 13, 2023
10 checks passed

grovduck deleted the sknnr-integration branch November 13, 2023 21:49

This was referenced Nov 13, 2023

Server-side code to create ordinations in GEE has memory issues #3

Closed

Merge fb-add-methods branch into main #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create client-side ordinations to avoid server-memory issues #4

Create client-side ordinations to avoid server-memory issues #4

grovduck commented Oct 31, 2023 •

edited

grovduck commented Oct 31, 2023

aazuspan commented Oct 31, 2023

grovduck commented Oct 31, 2023

aazuspan commented Oct 31, 2023

grovduck commented Oct 31, 2023

grovduck commented Nov 2, 2023 •

edited

aazuspan commented Nov 2, 2023

grovduck commented Nov 2, 2023

aazuspan commented Nov 2, 2023

aazuspan left a comment

grovduck commented Nov 6, 2023

grovduck left a comment

aazuspan commented Nov 7, 2023

grovduck commented Nov 7, 2023

aazuspan commented Nov 7, 2023 •

edited

grovduck commented Nov 7, 2023

grovduck commented Nov 9, 2023

aazuspan commented Nov 9, 2023

grovduck commented Nov 10, 2023

aazuspan commented Nov 10, 2023

grovduck commented Nov 13, 2023

Create client-side ordinations to avoid server-memory issues #4

Create client-side ordinations to avoid server-memory issues #4

Conversation

grovduck commented Oct 31, 2023 • edited

grovduck commented Oct 31, 2023

aazuspan commented Oct 31, 2023

grovduck commented Oct 31, 2023

aazuspan commented Oct 31, 2023

grovduck commented Oct 31, 2023

grovduck commented Nov 2, 2023 • edited

Footnotes

aazuspan commented Nov 2, 2023

grovduck commented Nov 2, 2023

aazuspan commented Nov 2, 2023

aazuspan left a comment

Choose a reason for hiding this comment

grovduck commented Nov 6, 2023

grovduck left a comment

Choose a reason for hiding this comment

aazuspan commented Nov 7, 2023

grovduck commented Nov 7, 2023

aazuspan commented Nov 7, 2023 • edited

grovduck commented Nov 7, 2023

grovduck commented Nov 9, 2023

aazuspan commented Nov 9, 2023

grovduck commented Nov 10, 2023

aazuspan commented Nov 10, 2023

grovduck commented Nov 13, 2023

grovduck commented Oct 31, 2023 •

edited

grovduck commented Nov 2, 2023 •

edited

aazuspan commented Nov 7, 2023 •

edited