Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add area averaged mean aggfunc to dissolve #3229

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nicholas-ys-tan
Copy link
Contributor

Resolves #3028.

A proposed approach to do an area averaged mean (specific aggfunc argument TBD).

This approach allows us to use the existing pandas aggfunc so the kwargs still take effect.

Have refactored to process geometries first, allowing processing of spatial_average_mean in one if statement, instead of having one at the start during non-geometry processing, and one after geometry processing (ref first commit before refactoring).

Haven't fully tested the kwargs, nor written unit testing yet, nor documentation. Seeking prelim feedback before progressing further.

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented Mar 30, 2024

Have a question on desired behaviour. In the event a polygon and a 1-dimensional geometry type (Point or LineString) are dissolved together. Would we want the area weighted mean to:

  1. raise an error
  2. ignore the point data and just operate on the 2d data
  3. return nan in any rows that involved a 1-dimensional geometry (current behaviour)

@martinfleis
Copy link
Member

I would raise an error as area-weighted mean is not defined for other than polygonal geometries.

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented Apr 1, 2024

Latest update greatly improves readability by using a user-defined function for aggfunc but have run into some issues around the numeric_only keyword when running with a dataframe that has non-numeric columns. Using numeric_only argument gets fed into the UDF for area-weighted mean.

Seems to be a well-documented but perhaps an incompletely resolved issue at this time - pandas-dev/pandas#50538. Continuing to look into it to determine best approach.

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented Apr 1, 2024

I seem to have opened a can of worms by opting to feed a function to aggfunc to get the area-weighted-mean. I'm summarising below to keep track of the relevant information and where it stands today.

  • This is an existing issue on pandas side where the behaviour of numeric_only seems to work, it works fine when using pandas functions: 'first', 'last', 'min', 'max', 'sum', 'mean', 'median'. But not when fed with a function e.g. np.mean. It tries to feed the kwarg numeric_only into the function np.mean.
  • Feeding a function works fine if we don't use any kwargs. The issue with this arises if using a dataframe with non-numeric datatypes. An error will raise when trying the find the np.mean of strings for example.
  • This makes my current set-up untenable if trying to find the area-weighted-mean of a dataframe with non-numeric datatypes. I can make the area-weighted lambda function take in **kwargs** but then it will still error out when trying to operate on the string data.
  • There seems to have been a PR merged to 1.5.x-dev on pandas that would have provided more informative error messages but for some reason don't seem to be able to find it in the main branch. REGR: NumPy func warning when dropping nuisance in agg, apply, transform pandas-dev/pandas#50627
  • As it stands it is currently on the pandas to do list: API: Consistently support numeric_only in groupby ops pandas-dev/pandas#56946

My view on what we can do next:

  • Put this ticket on ice until numeric_only compatibility with user-defined functions is addressed upstream (We would continue to have issues when feeding UDF aggfuncs to dissolve where non-numeric dtypes exist.
  • Address with an "ugly patch" - something like what @jorisvandenbossche put forward: API: GroupBy.agg() numeric_only deprecation with custom function pandas-dev/pandas#50538 (comment) but on geopandas end in dissolve. So if numeric_only has been fed as a kwarg, we can pop it and filter out the non-numeric columns in our end before passing through to the dataframe.groupby().agg() function.
  • Put in an error message if the user is trying to run area_weighted_mean with a dataframe with non-numeric datatypes that they would need to manually drop the non-numeric columns first. Can also be done more generally where it should be used if a UDF and numeric_only are supplied.
  • By default, always operate only on numeric columns regardless of numeric_only kwarg

It may be worth noting the existing docstring may cause some confusion as it suggests numeric_only may be needed in pandas 2.0 for certain agg functions, but what is not noted is that it would also cause issues for other agg functions (UDFs in particular).

Apologies if I am off the mark in my investigation in this and wasted your time reading this wall of text.

@martinfleis
Copy link
Member

@nicholas-ys-tan is this ready for review? Asking given the draft status of the PR.

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented May 21, 2024

I have submitted this PR to support using area_weighted_mean as an aggregate function in dissolve.

There is a major limitation to this in that the GeoDataFrame must be numeric only (with the exception of the geometry column and any columns listed in by for grouping.

In theory, we should have been able to get around this by using the numeric_only kwarg. However, there is an upstream bug in pandas where the numeric_only kwarg gets passed into any user-definied functions (UDF). This kwarg only works with the built-in methods such as 'sum', 'mean' etc. This is an existing issue where aggfunctions such as np.sum and np.mean` may cause issues.

I attempted to submit a PR to address this upstream, but ran into further bugs that made this challenging. I have suspended work there as it was interfering with the work they are already putting in to fix.
pandas-dev/pandas#58146

Support of numeric_only for UDF is on their radar too.
pandas-dev/pandas#56946

In the mean time, if we wanted to proceed with this, I have added error catching to disallow any dataframes with non-numeric columns (except geometry and those listed in by) to be used with area_weighted_mean. Alternatively we can wait for the bugs to be fixed and remove those checks.

@nicholas-ys-tan
Copy link
Contributor Author

@nicholas-ys-tan is this ready for review? Asking given the draft status of the PR.

Thanks, yes, it is, I was just writing a comment to where this is now. Some upstream bugs detailed in the comments have enforced some limitations on what dataframes this can be used for, not sure if this is something we would still want to introduce at this time. But perhaps there is a better way to do this which I haven't thought of yet.

@nicholas-ys-tan nicholas-ys-tan marked this pull request as ready for review May 21, 2024 06:36
@martinfleis
Copy link
Member

One way around these limitations is to ensure that we intercept "area_weighted_mean" in the dict input, so users can specify which columns shall be aggregated using area_weighted_mean and which shall use some other method. Then we can revise as soon as pandas provides better support.

Intercepting list input would also be useful but that won't solve any of the numeric_only issues.

@nicholas-ys-tan
Copy link
Contributor Author

One way around these limitations is to ensure that we intercept "area_weighted_mean" in the dict input, so users can specify which columns shall be aggregated using area_weighted_mean and which shall use some other method. Then we can revise as soon as pandas provides better support.

Intercepting list input would also be useful but that won't solve any of the numeric_only issues.

Thanks @martinfleis , I had actually missed that aggfunc also accepts list and dictionary inputs. I have updated with the latest commit to process list and dict inputs for aggfuncs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support area-weighted mean in dissolve
2 participants