Automatically handle missing values in OrdinalEncoder #28892

oasidorshin · 2024-04-25T14:19:47Z

Describe the workflow you want to enable

Currently, NaN values in OrdinalEncoder are either passed through as NaN, or encoded into user-specified value.

It would be nice to have a third option: consider NaN values as another category and map them into num_categories + 1 or some other value.

Describe your proposed solution

Add another encoded_missing_value option auto, that encodes them into another category

Describe alternatives you've considered, if relevant

No response

Additional context

There is also some confusion with user-specified value: if for example I set this value as 0, will it interfere with 0 category that was present during fit? Or all categories will be moved accordingly?

Manually setting some other values like 1000000 or -1 is usually incompatible with common categorical features interfaces, e.g. nn.Embedding from Pytorch and so on

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-04-29T12:58:09Z

I think that the request is well motivated. We did not think about integration with torch.nn.Embedding when suggesting to use -1 as a fixed category in our docstring example.

I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.

However I suppose we would need to do something similar for unknown_value (when handle_unknown="use_encoded_value").

Any opinion on how to handle hparam interactions with a sane naming scheme?

/cc @thomasjpfan

thomasjpfan · 2024-05-04T15:09:16Z

I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.

I am +1 on the strategy. I would give it a more descriptive name such as: encoded_missing_value="num_categories" or "cardinality".

However I suppose we would need to do something similar for unknown_value

Are you thinking of encoding unknown values as num_categories as well?

oasidorshin · 2024-05-04T17:41:23Z

@thomasjpfan @ogrisel

Are you thinking of encoding unknown values as num_categories as well?

I would advise against that, or at least make it configurable. In many cases, absense of a feature (i.e. nan) has a lot of signal (for example, it might mean "no data from certain source"), and this signal is different from "has some data, but it hasn't been seen before"

Another option would be to encode rare categories as most common category - so we have the option to either automatically map to most common, or map into user-set category.

oasidorshin added Needs Triage Issue requires triage New Feature labels Apr 25, 2024

ogrisel added API Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Apr 29, 2024

thomasjpfan added the module:preprocessing label May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically handle missing values in OrdinalEncoder #28892

Automatically handle missing values in OrdinalEncoder #28892

oasidorshin commented Apr 25, 2024

ogrisel commented Apr 29, 2024

thomasjpfan commented May 4, 2024

oasidorshin commented May 4, 2024 •

edited

Automatically handle missing values in OrdinalEncoder #28892

Automatically handle missing values in OrdinalEncoder #28892

Comments

oasidorshin commented Apr 25, 2024

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

ogrisel commented Apr 29, 2024

thomasjpfan commented May 4, 2024

oasidorshin commented May 4, 2024 • edited

oasidorshin commented May 4, 2024 •

edited