Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelEncoder doesn't handle missing values in *dask* series of strings #954

Open
phobson opened this issue Dec 15, 2022 · 3 comments
Open

Comments

@phobson
Copy link

phobson commented Dec 15, 2022

Describe the issue:

When using a LabelEncoder on a dask series with missing values (as np.nan), a TypeError is raised with "<" being undefined for floats and strings.

scikit-learn's encoder seems to handle this well for pandas and dask series. We seem to handle it well with a pandas series.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.preprocessing import LabelEncoder as dask_le
from sklearn.preprocessing import LabelEncoder as skl_le
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": list("aaaabbbcccdddeeefffgggg")
})

df.loc[[0, 2, 5, 10, 21], "A"]  = np.nan

ddf = dd.from_pandas(df, npartitions=3)

# works
lenc = skl_le().fit(df["A"])
lenc = skl_le().fit(ddf["A"])
lenc = dask_le().fit(df["A"])

# fails
lenc = dask_le().fit(ddf["A"])

# but also works
lenc = dask_le().fit(ddf["A"].fillna(""))

Full Trackback:

➜ python label_encoder_repro.py
Traceback (most recent call last):
  File "/Users/paul/work/sources/dask-engineering/example-pipelines/criteo-HPO/label_encoder_repro.py", line 21, in 
    lenc = dask_le().fit(ddf["A"])
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/preprocessing/label.py", line 119, in fit
    self.classes_ = classes_.compute()
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/base.py", line 600, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in 
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/utils.py", line 71, in apply
    return func(*args, **kwargs)
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask/array/routines.py", line 1626, in _unique_internal
    u = np.unique(ar)
  File "<__array_function__ internals>", line 180, in unique
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 274, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts, 
  File "/Users/paul/mambaforge/envs/ml-example/lib/python3.10/site-packages/numpy/lib/arraysetops.py", line 336, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'str' and 'float'

Environment:

  • Dask version: 2012.12.0
  • Python version: 3.10
  • Operating System: M1 Mac
  • Install method (conda, pip, source): conda
@DuanBoomer
Copy link

Tags: @phobson
Hello, can I work on the issue titled "LabelEncoder doesn't handle missing values in dask series of strings #954".

@phobson
Copy link
Author

phobson commented Dec 19, 2022

@DuanBoomer I'd be happy to review a PR. Thanks for volunteering. Note that I'll be largely away from my computer this week through the New Year. So if my response time is slow, I haven't forgotten about you.

@DuanBoomer
Copy link

@phobson The PR will be submitted by Sunday if that's okay with you. Today is Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants