-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create client-side ordinations to avoid server-memory issues #4
Conversation
* Added method (__init__) and property (k_nearest) used across subclasses.
In order to run ordinations client-side, there needs to be a mechanism for translating `ee.FeatureCollections` to python objects (in this case, pydantic models). Based on work from @aazuspan, these classes provide multi-threaded functionality to convert feature collections in a lower-memory context. `Feature` and `FeatureCollection` have a few helper functions to extract properties as values and lists.
@aazuspan, a stupid question. I thought using:
would allow the use of
Any help would be very much appreciated! |
Interesting! If I'm remembering right, the A little short on details, but pydantic/pydantic#2112 seems to be describing the same issue. On the opposite end of the spectrum, pydantic/pydantic#2678 has a lot of detail, although I haven't read deeply enough into that discussion to figure out what the conclusion was or whether it's relevant to backwards compatibility... |
@aazuspan , thanks for your help finding those issues. I'm not sure I understand much better than I did before other than it seems I'll have issues using
Typing is great to have, but kind of a pain as well! |
I'd be tempted to drop
💯 |
This commit implements client-side fitting of four of the five estimators (excluding Raw) and implements them using sknnr transformers associated with each method. For this commit only, we have introduced a new method in each estimator called `train_client` whilst retaining the original `train` which is a server-side train/fit. The general workflow is to convert the training data into a client-side feature collection (class FeatureCollection), fit the model using the transformer, and convert all needed data objects back into server-side objects to work with `predict` and `predict_fc`. For all but GNN, `predict` and `predict_fc` are unchanged and tests are passing for both client-side and server-side training. For GNN, there were slight modifications to `predict` and `predict_fc` in the transformations such that only client-side fitting is working. There are expected test failures in this commit. In addition, we have dropped support for python 3.8 in this commit as types `list` and `dict` were not compatible with pydantic BaseModel-derived classes for 3.8. Still, much refactoring needs to be done to remove duplication across the methods.
Great, thanks for your input. I've dropped 3.8 (and added 3.12) to this commit. |
This commit changes the class hierachy to reflect that of `sknnr`. The `GeeKnnClassifier` class has been renamed to `Raw`, a new abstract class called `Transformed` derives from `Raw` and all other estimators (`Eucliean`, `Mahalanobis`, `MSN`, and `GNN`) are all subclasses of `Transformed`. `Transformed` does most of the work in training and predicting the estimators and four abstract methods delegate the responsiblity of defining the transformer to the subclasses. Additionally, `cca.py` and `ccora.py` are no longer needed as all model fitting is now done through `sknnr`.
@aazuspan, if you have the time (and can stomach it), I'd love for you to take a quick look at the changes here. The general thrust of this PR is to bring in the
There are a few issues I know of at this point:
I know it's not in your nature, but please be brutally honest if you see some funky stuff and additional opportunities for refactoring. The good thing is that we're currently passing tests! Footnotes
|
Absolutely, I'm excited to take a look and get a better idea of how this works! I tried running tests and it looks like I can access the table assets, but not the images. Do you mind sharing those? FAILED tests/test_model_spatial.py::test_image_match[5-raw] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/raw_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-euc] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/euc_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-mah] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/mah_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-msn] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/msn_neighbors_600' not found (does not exist or caller does not have access).
FAILED tests/test_model_spatial.py::test_image_match[5-gnn] - ee.ee_exception.EEException: Image.load: Image asset 'users/gregorma/gee-knn/test-check/gnn_neighbors_600' not found (does not exist or caller does not have access). |
Sorry about that. Should be shared with your gmail account now. Let me know if you have further issues. |
All tests passing now, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of really cool stuff going on here! I'm still trying to grok how it all fits together, so let me know if any of my comments are totally off base, but I made a first pass at least.
I felt a bit weird about calling these regressors again. But maybe it's better to keep them consistent with sknnr rather than with GEE.
Choosing names remains the hardest part of programming! I'm not sure that staying consistent with sknnr
is necessarily the perfect choice, but it does seem like a defensible one at least, and would make it easy for someone familiar with one package to adapt to the other (although if you ever tried using them in a single script there might be some namespace headaches...). I think that's where I would lean by default, just because I don't have any better ideas.
@aazuspan, thanks for such a thorough review. I'll be picking through it over the next couple of days (then back to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aazuspan, I think I had started a review myself, but had never submitted it. Hopefully you can see my responses to your comments.
I do see a couple, but I got an email notification about a bunch of comments that aren't showing up for me, e.g. your answer about |
I really messed everything up. But I think here is what happened (chronologically):
So, I'll try to go back and put in what I had written (I don't think deleted comments can be retrieved), but probably best to use the emails that were sent as the definitive source. I think the crux is that one shouldn't respond when an active review is ongoing. Sorry about this! |
Oh no, that's a hassle to have to re-enter all your responses. I think I have the text of all your responses in the notification emails, I just don't necessarily know what they were responses to (although I can probably guess in most cases). Do you have those, or would it be helpful if I forwarded the email? Github is a pretty smooth experience overall, but figuring out where/if review comments are going to show up seems like it's always a little bit of a gamble, so you're not alone! |
If you have those and it would be easy enough to send, that would be great. I don't get a copy of changes I make. |
- Remove no-op `Raw.transform_image` and `Raw.transform_fc` - Convert `Raw.predict`, `Transformed.predict`, and `Transformed.transform` into `singledispatchmethod`s so that image and feature collection transformation and prediction comes from `transform` and `predict`, respectively.
I know it's not what you suggested, but I'm considering the following name changes:
I'm thinking of this from the following perspective:
Thoughts? |
Good point! I think your logic makes sense, and I'm happy to go with the |
…medKNNClassifier.predict`
@aazuspan, you've been incredibly gracious with your time on reviewing this PR. Thank you so much. I'm running out of steam a bit and will migrate these issues you've identified below into separate issues.
There is still a bit of work to do, but I'd like to merge in this PR now, mostly because I need to test with the larger SERVIR workflow this coming week. I've learned that this was much too ambitious of a PR on its own! |
Sounds great @grovduck! Thanks for helping me get up to speed on this. I'm excited to start exploring it with some real data! |
This PR addresses #3 by leveraging
sknnr
to provide estimators for this package. As noted in #3, running the estimators client-side will "interrupt" the GEE server-side flow and thus run training only once for all targets. It also removes the duplication of implementing these ordination estimators in more than one repository.This is still very much a work in progress and I will be addressing the following issues:
Create a base class for all estimators to reduce duplicated code across estimatorssknnr
, makeRaw
the base classRaw
(currently calledTransformed
) to be the base class for all estimators that use transformationssknnr
transformation objects to handle client-side ordinationpredict
andpredict_fc
toTransformed
and remove from derived classes@abstractmethod
methods inTransformed
to handle transformations and implement in derived classesAs this PR turned out to be very ambitious, there are a few tasks that I decided not to tackle with this PR and turn into separate issues:
Colocation
class so that user doesn't need to instantiaten_components
andspp_transform
values onGNN
to ensure correctnessmode
keyword on feature collections to get distances and and testssknnr-spatial
and this training data to see if we can increase the number of expected matches of pixels in the test surfacesid_field
does not store an integer type