regression in a custom groupby.agg when split_out is not None #8611

martinfleis · 2022-01-23T15:50:24Z

What happened:

It seems that the changes in #8468 broke custom aggregation over in dask_geopandas, where we implement our own function to be able to aggregate geometries. When specifying split_out in the .agg method, it currently raises an error. With split_out=None, it is all fine, so I think that this is an issue on the dask side, not on dask_geopandas side.

What you expected to happen:

I expect that the aggregation will work no matter if I specify split_out or not.

Minimal Complete Verifiable Example:

This example requires geopandas and dask_geopandas, I wasn't able to craft one with pure dask.

import geopandas
import dask_geopandas

world = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(world, npartitions=4)


def union(block):
    merged_geom = block.unary_union
    return merged_geom

merge_geometries = dd.Aggregation(
    "merge_geometries", lambda s: s.agg(union), lambda s0: s0.agg(union)
)
data_agg = {col: aggfunc for col in ddf.columns.drop([ddf.geometry.name])}
data_agg[ddf.geometry.name] = merge_geometries

# this works
# aggregated = ddf.groupby(by="continent").agg(data_agg)

# this doesn't
aggregated = ddf.groupby(by="continent").agg(data_agg, split_out=2)

Traceback:

```py --------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err:

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'geometry'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
~/Git/geopandas/geopandas/geodataframe.py in set_geometry(self, col, drop, inplace, crs)
310 try:
--> 311 level = frame[col]
312 except KeyError:

~/Git/geopandas/geopandas/geodataframe.py in getitem(self, key)
1349 """
-> 1350 result = super().getitem(key)
1351 geo_col = self._geometry_column_name

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/frame.py in getitem(self, key)
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364

KeyError: 'geometry'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
174 try:
--> 175 yield
176 except Exception as e:

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in _emulate(func, udf, args, **kwargs)
5965 with raise_on_meta_error(funcname(func), udf=udf):
-> 5966 return func(_extract_meta(args, True), **_extract_meta(kwargs, True))
5967

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in _extract_meta(x, nonempty)
5945 elif isinstance(x, tuple):
-> 5946 return tuple(_extract_meta(_x, nonempty) for _x in x)
5947 elif isinstance(x, dict):

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in (.0)
5945 elif isinstance(x, tuple):
-> 5946 return tuple(_extract_meta(_x, nonempty) for _x in x)
5947 elif isinstance(x, dict):

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in _extract_meta(x, nonempty)
5941 if isinstance(x, (Scalar, _Frame)):
-> 5942 return x._meta_nonempty if nonempty else x._meta
5943 elif isinstance(x, list):

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in _meta_nonempty(self)
430 """A non-empty version of _meta with fake data."""
--> 431 return meta_nonempty(self._meta)
432

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/utils.py in call(self, arg, *args, **kwargs)
623 meth = self.dispatch(type(arg))
--> 624 return meth(arg, *args, **kwargs)
625

~/Git/dask-geopandas/dask_geopandas/backends.py in _nonempty_geodataframe(x)
59 df = meta_nonempty_dataframe(x)
---> 60 return geopandas.GeoDataFrame(df, geometry=x._geometry_column_name, crs=x.crs)
61

~/Git/geopandas/geopandas/geodataframe.py in init(self, data, geometry, crs, *args, **kwargs)
203 # TODO: raise error in 0.9 or 0.10.
--> 204 self.set_geometry(geometry, inplace=True)
205

~/Git/geopandas/geopandas/geodataframe.py in set_geometry(self, col, drop, inplace, crs)
312 except KeyError:
--> 313 raise ValueError("Unknown column %s" % col)
314 except Exception:

ValueError: Unknown column geometry

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
/var/folders/2f/fhks6w_d0k556plcv3rfmshw0000gn/T/ipykernel_44916/2101873994.py in
12 data_agg = {col: aggfunc for col in ddf.columns.drop([ddf.geometry.name])}
13 data_agg[ddf.geometry.name] = merge_geometries
---> 14 aggregated = ddf.groupby(by="continent").agg(data_agg, split_out=2)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/groupby.py in agg(self, arg, split_every, split_out)
2006 @derived_from(pd.core.groupby.DataFrameGroupBy)
2007 def agg(self, arg, split_every=None, split_out=1):
-> 2008 return self.aggregate(arg, split_every=split_every, split_out=split_out)
2009
2010

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/groupby.py in aggregate(self, arg, split_every, split_out)
2002 return self.size()
2003
-> 2004 return super().aggregate(arg, split_every=split_every, split_out=split_out)
2005
2006 @derived_from(pd.core.groupby.DataFrameGroupBy)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/groupby.py in aggregate(self, arg, split_every, split_out)
1619 )
1620
-> 1621 return aca(
1622 chunk_args,
1623 chunk=_groupby_apply_funcs,

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in apply_concat_apply(args, chunk, aggregate, combine, meta, token, chunk_kwargs, aggregate_kwargs, combine_kwargs, split_every, split_out, split_out_setup, split_out_setup_kwargs, sort, ignore_index, **kwargs)
5878 # Blockwise Split Layer
5879 if split_out and split_out > 1:
-> 5880 chunked = chunked.map_partitions(
5881 hash_shard,
5882 split_out,

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in map_partitions(self, func, *args, **kwargs)
769 None as the division.
770 """
--> 771 return map_partitions(func, self, *args, **kwargs)
772
773 @insert_meta_param_description(pad=12)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs)
6033 # Use non-normalized kwargs here, as we want the real values (not
6034 # delayed values)
-> 6035 meta = _emulate(func, *args, udf=True, **kwargs)
6036 else:
6037 meta = make_meta(meta, index=meta_index, parent_meta=parent_meta)

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py in _emulate(func, udf, args, **kwargs)
5964 """
5965 with raise_on_meta_error(funcname(func), udf=udf):
-> 5966 return func(_extract_meta(args, True), **_extract_meta(kwargs, True))
5967
5968

~/mambaforge/envs/geo_dev/lib/python3.9/contextlib.py in exit(self, typ, value, traceback)
135 value = typ()
136 try:
--> 137 self.gen.throw(typ, value, traceback)
138 except StopIteration as exc:
139 # Suppress StopIteration unless it's the same exception that

~/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/utils.py in raise_on_meta_error(funcname, udf)
194 )
195 msg = msg.format(f" in {funcname}" if funcname else "", repr(e), tb)
--> 196 raise ValueError(msg) from e
197
198

ValueError: Metadata inference failed in hash_shard.

You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:

ValueError('Unknown column geometry')

Traceback:

File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/utils.py", line 175, in raise_on_meta_error
yield
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py", line 5966, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py", line 5946, in _extract_meta
return tuple(_extract_meta(_x, nonempty) for _x in x)
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py", line 5946, in
return tuple(_extract_meta(_x, nonempty) for _x in x)
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py", line 5942, in _extract_meta
return x._meta_nonempty if nonempty else x._meta
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/dataframe/core.py", line 431, in _meta_nonempty
return meta_nonempty(self._meta)
File "/Users/martin/mambaforge/envs/geo_dev/lib/python3.9/site-packages/dask/utils.py", line 624, in call
return meth(arg, *args, **kwargs)
File "/Users/martin/Git/dask-geopandas/dask_geopandas/backends.py", line 60, in _nonempty_geodataframe
return geopandas.GeoDataFrame(df, geometry=x._geometry_column_name, crs=x.crs)
File "/Users/martin/Git/geopandas/geopandas/geodataframe.py", line 204, in init
self.set_geometry(geometry, inplace=True)
File "/Users/martin/Git/geopandas/geopandas/geodataframe.py", line 313, in set_geometry
raise ValueError("Unknown column %s" % col)

</details>

**Anything else we need to know?**:

**Environment**:

- Dask version: '2022.01.0+17.g08eee6d6'
- Python version: 3.9.7
- Operating System: macOS, ubuntu
- Install method (conda, pip, source): pip + git

<!-- If you are reporting an issue such as scale stability, cluster deadlock.
Please provide a cluster dump state with this issue, by running client.dump_cluster_state()

https://distributed.dask.org/en/stable/api.html?highlight=dump_cluster_state#distributed.Client.dump_cluster_state

-->

<details>
<summary>Cluster Dump State:</summary>

</details>

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2022-01-23T16:57:53Z

It's probably not a regression in dask, but an existing issue with the interaction with geopandas.GeoDataFrame being a special sublcass.

What changed in #8468, is that apply_concat_apply starts to use map_partitions. And with split_out specified, it calls map_partitions a second time, which I think is what triggers the direct error. But I think it's actually already caused by the first call of map_partitions.
In our case of the dissolve method, this is basically a wrapper around groupby. And then, the function being mapped is _groupby_apply_funcs, which has this part in the code:

dask/dask/dataframe/groupby.py

Lines 929 to 930 in 08eee6d

    
           if is_dataframe_like(df): 
        
               return df.__class__(result)

where result is a dictionary of columns, but with renamed column names.

A very simple example that reproduces the underlying issue (without involving groupby / apply_concat_apply):

import geopandas
import dask_geopandas

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(df, npartitions=4)

def func(df):
    result = {}
    for col in df.columns:
        result["prefix-" + col] = df[col]
    # naively recreating the class, without accounting for the renamed geometry column
    return df.__class__(result)


>>> res = ddf.map_partitions(func)
>>> res
Dask GeoDataFrame Structure:
              prefix-pop_est prefix-continent prefix-name prefix-iso_a3 prefix-gdp_md_est prefix-geometry
npartitions=4                                                                                            
0                      int64           object      object        object           float64        geometry
45                       ...              ...         ...           ...               ...             ...
90                       ...              ...         ...           ...               ...             ...
135                      ...              ...         ...           ...               ...             ...
176                      ...              ...         ...           ...               ...             ...
Dask Name: func, 8 tasks

>>> res._meta
Empty GeoDataFrame
Columns: [prefix-pop_est, prefix-continent, prefix-name, prefix-iso_a3, prefix-gdp_md_est, prefix-geometry]
Index: []

>>> res._meta._geometry_column_name
'geometry'

So we get a dask_geopandas GeoDataFrame where the _meta is an "invalid" GeoDataFrame, because the geometry column name is not correctly set.

Since dask cannot know that it should handle GeoDataFrame construction in a special way inside _groupby_apply_funcs, this is probably something to discuss on the geopandas side on how we can deal with this.

rjzamora · 2022-01-24T14:34:39Z

Thanks for investigating this @jorisvandenbossche! Feel free to ping me if we need changes on the Dask side

jorisvandenbossche · 2022-01-24T14:45:36Z

Maybe a general question about the groupby logic in dask: why is it needed to rename the columns (with prefix about the task and hash suffix) in the intermediary objects?
(it's this renaming of columns that causes issues for geopandas)

jorisvandenbossche · 2022-01-27T09:06:26Z

The renaming seems to happen at

dask/dask/dataframe/groupby.py

Lines 574 to 575 in 08eee6d

    
           def _make_agg_id(func, column): 
        
               return f"{func!s}-{column!s}-{tokenize(func, column)}"

If I naively change that to return the original column, the example in the top post works again (but there are also a whole bunch of groupby tests that fail). Now, I suppose there is a good reason for this renaming of the aggregated columns, but I don't directly understand it.

martinfleis · 2022-01-27T21:54:26Z

We have discussed this issue with @jorisvandenbossche and it is unlikely that we will be able to fix it on geopandas or dask-geopandas side.

geopandas.GeoDataFrame defines a .geometry property that takes care of all geospatial operations (interfacing shapely and pygeos). .geometry always points to one of the columns of the GeoDataFrame based on its name ("geometry" by default but can be any). When dask renames the columns in groupby, this link is broken and any geospatial operation called within groupby (like the unary_union above) fails because geopandas doesn't know which column is the active geometry anymore as the name it expects does no longer exist in the GeoDataFrame. Since there can be multiple geometry columns (GeoSeries), we cannot infer this active geometry from the dtype and using positional index also does not seem to be feasible.

I am not certain what has changed in dask now, but it seems that until #8468 the renaming either did not happen or happened in a way that did not break this link column<->.geometry. But I think that we'll need your help on the dask side to fix this issue.

gjoseph92 · 2022-01-27T22:30:28Z

Is this a regression we need to fix or revert before releasing dask/community#216? cc @jrbourbeau

@martinfleis @jorisvandenbossche do you have a sense of what the fix would be? Not renaming the columns in groupby seems like one idea—I haven't checked yet to understand why we do that. But I get the sense that the more robust fix would be to use something more advanced than:

dask/dask/dataframe/groupby.py

Lines 929 to 930 in 08eee6d

    
           if is_dataframe_like(df): 
        
               return df.__class__(result)

Even if we stopped renaming the columns, if a user was using a geometry column that wasn't named geometry, wouldn't this still break? It seems to me like we need some interface to get extra arguments into the dataframe constructor (or to specify a custom constructor function, which you could functools.partial as needed).

martinfleis · 2022-01-27T22:53:25Z

Is this a regression we need to fix or revert before releasing dask/community#216?

It breaks dissolve() in dask-geopandas when split_out is specified but we haven't released this yet (will do in a week or two, hopefully). So if there's a perspective that this will get fixed in on of the upcoming versions there's no need to rush for 2022.01.1.

Even if we stopped renaming the columns, if a user was using a geometry column that wasn't named geometry, wouldn't this still break?

It probably would. When the GeoDataFrame is constructed from a dict as in this case, geopandas either checks for "geometry" or needs the geometry column to be specified. I think that we would need to pass a custom constructor that would get the information on the geometry column from _meta? I am not really sure, my insight into dask internals is quite limited. Maybe @jorisvandenbossche got some ideas.

rjzamora · 2022-01-28T20:23:04Z

Hmm - I am having a hard time figuring out how #8468 (or setting split_out) ended up changing behavior like this. It really seems like _groupby_apply_funcs has always initialized a new dataframe-like object with modified column names. You are correct that we do not need to rename columns in all cases, but we certainly do need to create new names if we are performing more than one type of aggregation on the same column (pretty common). For example, we may want the "min" and "max" of the same column, or we may need the count and the sum to calculate mean in the final output.

I'm really hoping to find exactly what changed in #8468, but it also seems like the assumption that this geometry parameter is always defined may be a bit tricky in the case of a ACA. Is it possible to point this property to a fallback function that raises an error if the user tries to do anything that requires it? I realize you would typically want to let the user know right away when they initialize an invalid object, so no worries if this seems unattractive. What would be the ideal approach in the case that you are performing multiple aggregations on the "geometry" column here? Would you want both to be referenced by geometry?

martinfleis · 2022-01-28T22:24:27Z

@rjzamora It had to be some commit merged on 22/1 and #8468 seems to be the only candidate.

Is it possible to point this property to a fallback function that raises an error if the user tries to do anything that requires it?

I am not sure what do you mean but we need to use it, we can't just raise an informative error. The whole point of dask-geopandas is that is enables geospatial operations and the needs to include aggregations. The dissolve example we are talking about here is one of the most common use cases.

The invalid object is not created by user, it is created by dask and user has no way of influencing it as far as I can tell.

What would be the ideal approach in the case that you are performing multiple aggregations on the "geometry" column here? Would you want both to be referenced by geometry?

The actual aggregation of the column is not necessarily problematic. If you get a GeoSeries, you no longer have the issue with unset geometry, and the geometric operation just works. So they can be called whatever you need them to be called and it should not affect the aggregation. In the example above, the error does not come from calling the geometric unary_union but from recreating meta if I understand the error correctly. This is where the issue is raised:

@meta_nonempty.register(geopandas.GeoDataFrame)
def _nonempty_geodataframe(x):
    df = meta_nonempty_dataframe(x)
    return geopandas.GeoDataFrame(df, geometry=x._geometry_column_name, crs=x.crs)

Once Dask calls this after it renames the columns, x._geometry_column_name which is a string, usually "geometry" cannot find a column of the same name and raises a ValueError.

rjzamora · 2022-01-28T23:37:23Z

if I understand the error correctly. This is where the issue is raised:

 Ah - I was completely focused on the return df.__class__(result) discussion above. However, I now realize that GeoDataFrame(result) will work fine when result is a dict without a “geometry” field. All of my questions/suggestions above were assuming that something like that will fail. The only time that “geometry” constraint shows up is when you try to create a GeoDataFrame metadata object. Now it makes perfect sense why moving to map_partitions in #8468 is changing behavior.

rjzamora · 2022-01-29T00:31:21Z

It seems like the problem may only show up when a "non-empty" metadata object needs to be generated by geopandas. I experimented with some quick things in ACA, and it seems like this branch (which avoids the need to non-empty metadata in thehas_shard operation relevent for split_out>1) may resolve this issue.

martinfleis · 2022-01-29T09:45:19Z

@rjzamora dask-geopandas CI passes with this branch, so it may be a solution. Thanks for looking into that!

jorisvandenbossche · 2022-01-29T09:53:33Z

Hmm - I am having a hard time figuring out how #8468 (or setting split_out) ended up changing behavior like this. It really seems like _groupby_apply_funcs has always initialized a new dataframe-like object with modified column names.

Given your last comments it might now be clear to you, but just to reiterate in general, my understanding is that:

The groupby implementation renames columns and recreates a GeoDataFrame (the return df.__class__(result) where result is a dict with renamed keys). This in itself is not a problem (geopandas will let that pass fine, although you get an "invalid" geodataframe, that's not a direct problem because the columns get renamed back to the original ones before the end result is passed to the user). But, it causes a problem in _meta_nonempty if this is called on the invalid GeoDataFrame (so if it is called before the columns are renamed back to the original).

Move DataFrame ACA aggregations to HLG #8468 started to use map_partitions in the aca implementation that is used in groupby.agg(..), and map_partitions calls _meta_nonempty (though _emulate:

dask/dask/dataframe/core.py

Lines 6041 to 6044 in 04ade86

    
           if meta is no_default: 
        
               # Use non-normalized kwargs here, as we want the real values (not 
        
               # delayed values) 
        
               meta = _emulate(func, *args, udf=True, **kwargs)

, which always calls _extract_meta with nonempty=True).
So indirectly, #8468 started to call _meta_nonempty.

With the default groupby.agg(..)/aca, map_partitions gets only called once, and at that point the _meta_nonempty gets constructed before the columns are renamed in the _groupby_apply_funcs.
But if using split_out, map_partitions gets called a second time, and then _meta_nonemtpy gets constructed again, and now with the renamed columns, giving the error in constructing the nonempty GeoDataFrame.

So indeed if we could avoid the use of _meta_nonempty in map_partitions, that would solve this issue.

While writing this up, I also thought of a potential solution on the dask-geopandas side in being a bit more flexible in creating the meta-nonempty:

--- a/dask_geopandas/backends.py
+++ b/dask_geopandas/backends.py
@@ -56,8 +56,12 @@ def _nonempty_geoseries(x, idx=None):
 
 @meta_nonempty.register(geopandas.GeoDataFrame)
 def _nonempty_geodataframe(x):
     df = meta_nonempty_dataframe(x)
-    return geopandas.GeoDataFrame(df, geometry=x._geometry_column_name, crs=x.crs)
+    try:
+        return geopandas.GeoDataFrame(df, geometry=x._geometry_column_name, crs=x.crs)
+    except:
+        return geopandas.GeoDataFrame(df)

This also "solves" the issue for this specific case. I am only not sure if it won't give problems in other cases.
The reason this fixes the issue here for groupby, is because this will not error when creating the _meta_nonempty the second time with the renamed columns, and because the columns are renamed back to the original ones before returning the result to the user, the end result is a proper GeoDataFrame (this will also only work with the default "geometry" column name).

martinfleis · 2022-01-29T10:12:36Z

@jorisvandenbossche I think that the solution avoiding call to meta_nonempty is better here because, as you pointed out, this would work only for the geometry column named "geometry". It could allow a workaround but if we can make it generalisable for any name, that would be optimal.

rjzamora · 2022-01-31T15:39:46Z

@jorisvandenbossche - As far as I can tell, your understanding is 100% correct. I'm sorry I didn't get a chance to explain in detail, and appreciate you writing everything up!

I think it makes sense to "fix" this issue by either avoiding the call to meta_nonempty when the output will be invalid, or by making the _nonempty_geodataframe more flexible. However, since the hash_shard operation is not technically producing a public DataFrame collection here, I also think the simple change in #8641 may be best (avoid metadata emulation altogether (not just nonempty emulation), since we really don't care that the final result is a valid DataFrame object).

martinfleis mentioned this issue Jan 23, 2022

Failing tests with geopandas and dask main geopandas/dask-geopandas#135

Closed

ian-r-rose self-assigned this Jan 26, 2022

martinfleis mentioned this issue Jan 27, 2022

add warning about split_out issue in dissolve geopandas/dask-geopandas#140

Closed

jorisvandenbossche mentioned this issue Jan 31, 2022

TST: skip dissolve split_out tests for dask 2022.01.1 geopandas/dask-geopandas#150

Merged

rjzamora mentioned this issue Jan 31, 2022

Avoid invalid metadata emulation in DataFrame ACA #8641

Closed

3 tasks

rjzamora mentioned this issue Jan 31, 2022

Pass place-holder metadata to map_partitions in ACA code path #8643

Merged

4 tasks

rjzamora closed this as completed in #8643 Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression in a custom groupby.agg when split_out is not None #8611

regression in a custom groupby.agg when split_out is not None #8611

martinfleis commented Jan 23, 2022

Original error is below:

Traceback:

jorisvandenbossche commented Jan 23, 2022

rjzamora commented Jan 24, 2022

jorisvandenbossche commented Jan 24, 2022

jorisvandenbossche commented Jan 27, 2022

martinfleis commented Jan 27, 2022

gjoseph92 commented Jan 27, 2022 •

edited

martinfleis commented Jan 27, 2022

rjzamora commented Jan 28, 2022

martinfleis commented Jan 28, 2022

rjzamora commented Jan 28, 2022

rjzamora commented Jan 29, 2022

martinfleis commented Jan 29, 2022

jorisvandenbossche commented Jan 29, 2022 •

edited

martinfleis commented Jan 29, 2022

rjzamora commented Jan 31, 2022 •

edited

regression in a custom groupby.agg when split_out is not None #8611

regression in a custom groupby.agg when split_out is not None #8611

Comments

martinfleis commented Jan 23, 2022

Original error is below:

Traceback:

jorisvandenbossche commented Jan 23, 2022

rjzamora commented Jan 24, 2022

jorisvandenbossche commented Jan 24, 2022

jorisvandenbossche commented Jan 27, 2022

martinfleis commented Jan 27, 2022

gjoseph92 commented Jan 27, 2022 • edited

martinfleis commented Jan 27, 2022

rjzamora commented Jan 28, 2022

martinfleis commented Jan 28, 2022

rjzamora commented Jan 28, 2022

rjzamora commented Jan 29, 2022

martinfleis commented Jan 29, 2022

jorisvandenbossche commented Jan 29, 2022 • edited

martinfleis commented Jan 29, 2022

rjzamora commented Jan 31, 2022 • edited

gjoseph92 commented Jan 27, 2022 •

edited

jorisvandenbossche commented Jan 29, 2022 •

edited

rjzamora commented Jan 31, 2022 •

edited