Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically handle missing values in OrdinalEncoder #28892

Open
oasidorshin opened this issue Apr 25, 2024 · 3 comments
Open

Automatically handle missing values in OrdinalEncoder #28892

oasidorshin opened this issue Apr 25, 2024 · 3 comments

Comments

@oasidorshin
Copy link

Describe the workflow you want to enable

Currently, NaN values in OrdinalEncoder are either passed through as NaN, or encoded into user-specified value.

It would be nice to have a third option: consider NaN values as another category and map them into num_categories + 1 or some other value.

Describe your proposed solution

Add another encoded_missing_value option auto, that encodes them into another category

Describe alternatives you've considered, if relevant

No response

Additional context

There is also some confusion with user-specified value: if for example I set this value as 0, will it interfere with 0 category that was present during fit? Or all categories will be moved accordingly?

Manually setting some other values like 1000000 or -1 is usually incompatible with common categorical features interfaces, e.g. nn.Embedding from Pytorch and so on

@oasidorshin oasidorshin added Needs Triage Issue requires triage New Feature labels Apr 25, 2024
@ogrisel ogrisel added API Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Apr 29, 2024
@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2024

I think that the request is well motivated. We did not think about integration with torch.nn.Embedding when suggesting to use -1 as a fixed category in our docstring example.

I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.

However I suppose we would need to do something similar for unknown_value (when handle_unknown="use_encoded_value").

Any opinion on how to handle hparam interactions with a sane naming scheme?

/cc @thomasjpfan

@thomasjpfan
Copy link
Member

I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.

I am +1 on the strategy. I would give it a more descriptive name such as: encoded_missing_value="num_categories" or "cardinality".

However I suppose we would need to do something similar for unknown_value

Are you thinking of encoding unknown values as num_categories as well?

@oasidorshin
Copy link
Author

oasidorshin commented May 4, 2024

@thomasjpfan @ogrisel

Are you thinking of encoding unknown values as num_categories as well?

I would advise against that, or at least make it configurable. In many cases, absense of a feature (i.e. nan) has a lot of signal (for example, it might mean "no data from certain source"), and this signal is different from "has some data, but it hasn't been seen before"

Another option would be to encode rare categories as most common category - so we have the option to either automatically map to most common, or map into user-set category.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants