Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swifter.groupby() does not support with dropna=False #202

Open
yangyxt opened this issue Sep 28, 2022 · 5 comments
Open

swifter.groupby() does not support with dropna=False #202

yangyxt opened this issue Sep 28, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@yangyxt
Copy link

yangyxt commented Sep 28, 2022

I found that the swifter groupby apply chain will encounter the error when trying to sort index, if I set dropna to False for the groupby step.

Here is the error log:
Traceback (most recent call last): File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 76, in wrapper result = func(*args, **kwargs) File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 484, in BP2_PM3_compound_with_patho return df.swifter.groupby([gene_col], as_index=False, dropna=False).apply(check_compound_per_gene, File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 661, in apply return self._ray_apply(func, *args, **kwds) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 650, in _ray_apply return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index() File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/frame.py", line 6447, in sort_index return super().sort_index( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/generic.py", line 4685, in sort_index indexer = get_indexer_indexer( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 94, in get_indexer_indexer indexer = nargsort( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 417, in nargsort indexer = non_nan_idx[non_nans.argsort(kind=kind)] TypeError: '<' not supported between instances of 'int' and 'tuple' ERROR:2022-09-28 13:40:29,310:wrapper:83:Exception raised in main_anno_process. exception: '<' not supported between instances of 'int' and 'tuple'

The dataframe put to use swifter.groupby() has a common numerical index. From 0 to len(df).
The groupby column might have some rows with NA values and I do wish to keep them. I guess that's why this issue happened. I 'm not sure whether this can be fixed or optimized. Pls take a look.

@jmcarpenter2
Copy link
Owner

jmcarpenter2 commented Sep 28, 2022

Hey @yangyxt

Thanks for raising this issue. I tried to look into it and test with a synthetic dataframe. I included a NaN in the groups and didn't encounter this issue.
Screen Shot 2022-09-28 at 12 37 30 PM

Looking more closely at your error message, it looks as though you may have a tuple in your groupby column.

TypeError: '<' not supported between instances of 'int' and 'tuple'

Can you check if the column gene_col is entirely of type int only?

@fiskus2
Copy link

fiskus2 commented Jan 27, 2023

Hi @jmcarpenter2
I have the same issue, but it is unrelated to dropna in my case. After lots of debugging I can confirm that this error occurs under the following circumstances:

  • The dataframe has more than 5000 rows
  • The dataframe is grouped by more than 1 column
  • The applied function sorts the passed group
  • The dataframe has a column of type datetime (it must be present, but there is no need to do anything with it)
  • One group has less than 17 rows

Some of these requirements seem very arbitrary, so it may just be a sporadic error. Below is a script that produces the error. I have tested it on two different machines. However, I have also had other scripts that produced the error on one machine, but not the other.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import swifter
import platform
import ray
import psutil
import multiprocessing

print(pd.__version__)
print(swifter.__version__)
print(ray.__version__)
print(platform.python_version())
print(platform.platform())
print(psutil.virtual_memory().total/1000000000, 'GB')
print(multiprocessing.cpu_count())

def foo(group):
    group = group.sort_values('sort_col')
    return group

data = []
row1 = ['a', 1, 1, datetime(2023, 1, 1)]
row2 = ['b', 2, 2, datetime(2023, 1, 1)]
cols = ['group_col1', 'group_col2', 'sort_col', 'timestamp_col']

data = [row1]*1 + [row2]*5000   #This works: [row1]*17 + [row2]*5000
df = pd.DataFrame(data, columns=cols)

df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)

Output:

1.3.5
1.3.4
2.1.0
3.7.5
Windows-10-10.0.19041-SP0
34.358714368 GB
8
  0%|                                                                                            | 0/2 [00:00<?, ?it/s]
2023-01-27 16:15:27,963 INFO worker.py:1528 -- Started a local Ray instance.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.54s/it]
Traceback (most recent call last):
  File ".\swifter_error.py", line 30, in <module>
    df.swifter.groupby(['group_col1', 'group_col2']).apply(foo)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 661, in apply
    return self._ray_apply(func, *args, **kwds)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\swifter\swifter.py", line 650, in _ray_apply
    return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index()
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 6402, in sort_index
    key,
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 4545, in sort_index
    target, level, ascending, kind, na_position, sort_remaining, key
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 92, in get_indexer_indexer
    target, kind=kind, ascending=ascending, na_position=na_position
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 391, in nargsort
    return items.argsort(ascending=ascending, kind=kind, na_position=na_position)
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\arrays\base.py", line 633, in argsort
    mask=np.asarray(self.isna()),
  File "C:\Users\qxz2a5z\AppData\Roaming\Python\Python37\site-packages\pandas\core\sorting.py", line 403, in nargsort
    indexer = non_nan_idx[non_nans.argsort(kind=kind)]
TypeError: '<' not supported between instances of 'tuple' and 'int'

@jmcarpenter2
Copy link
Owner

jmcarpenter2 commented Mar 24, 2023

Thank you for this very clear and reproducible code and logging! I will look into this shortly

@jmcarpenter2 jmcarpenter2 self-assigned this Mar 24, 2023
@jmcarpenter2 jmcarpenter2 added the bug Something isn't working label Mar 24, 2023
@jmcarpenter2
Copy link
Owner

I tried running this code locally and did not run into the issue.. The only major difference I am seeing between our environments is that yours is Windows. I am going to start a new initiative to start testing this code on Windows machines as well as part of my CI. Also related to #175 #148 and potentially #176

Screen Shot 2023-03-24 at 11 19 10 AM

@jmcarpenter2
Copy link
Owner

Added Windows CI but it didnt uncover anything :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants