Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The log_loss-function crashes when using mixed types #976

Open
rsundqvist opened this issue Aug 15, 2023 · 0 comments
Open

The log_loss-function crashes when using mixed types #976

rsundqvist opened this issue Aug 15, 2023 · 0 comments

Comments

@rsundqvist
Copy link

Describe the issue:
When calling dask_ml.metrics.log_loss with mixed types, a ValueError is raised:

  File "<python-home>/lib/python3.11/site-packages/dask_ml/metrics/classification.py", line 106, in _log_loss_inner
    [sklearn.metrics.log_loss(x, y, sample_weight=sample_weight, **kwargs)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python-home>/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 211, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "<python-home>/lib/python3.11/site-packages/sklearn/metrics/_classification.py", line 2854, in log_loss
    check_consistent_length(y_pred, y_true, sample_weight)
  File "<python-home>/lib/python3.11/site-packages/sklearn/utils/validation.py", line 409, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [4, 2]

This works using "vanilla" sklearn/pandas/numpy functions and types.

Minimal Complete Verifiable Example:

from dask import array as da, dataframe as dd
from dask_ml.metrics import log_loss

y_true = dd.DataFrame.from_dict({"y_true": [True, False, True, False]}, npartitions=2)["y_true"]
y_pred = da.from_array([0, 1, 0, 1], chunks=2)

print(f"y_true is array: {log_loss(y_true.to_dask_array(), y_pred).compute()=}")
print(f"y_true is series: {log_loss(y_true, y_pred).compute()=}")

Anything else we need to know?:

  • Same issue is seen when the types of y_true and y_pred are switched.
  • A similar message is seen with identical types but differing npartitions and chunks-arguments.

Crashing on different partitioning probably makes sense, but I think it would be nice if the same-type requirement was documented, at least. Maybe it is and I just missed it.

Environment:

  • Dask version: 2023.8.0 (dask-ml: 2023.3.24)
  • Python version: 3.11.3
  • Operating System: Ubuntu 22.04.2 LTS. Running on Workstation Player 17 if that matters.
  • Install method (conda, pip, source): pip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant