You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, NaN values in OrdinalEncoder are either passed through as NaN, or encoded into user-specified value.
It would be nice to have a third option: consider NaN values as another category and map them into num_categories + 1 or some other value.
Describe your proposed solution
Add another encoded_missing_value option auto, that encodes them into another category
Describe alternatives you've considered, if relevant
No response
Additional context
There is also some confusion with user-specified value: if for example I set this value as 0, will it interfere with 0 category that was present during fit? Or all categories will be moved accordingly?
Manually setting some other values like 1000000 or -1 is usually incompatible with common categorical features interfaces, e.g. nn.Embedding from Pytorch and so on
The text was updated successfully, but these errors were encountered:
I think that the request is well motivated. We did not think about integration with torch.nn.Embedding when suggesting to use -1 as a fixed category in our docstring example.
I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.
However I suppose we would need to do something similar for unknown_value (when handle_unknown="use_encoded_value").
Any opinion on how to handle hparam interactions with a sane naming scheme?
I would be +1 for a PR to implement the "auto" strategy described above (encode non-missing from 0 to num_categories - 1 and use num_categories for missing.
I am +1 on the strategy. I would give it a more descriptive name such as: encoded_missing_value="num_categories" or "cardinality".
However I suppose we would need to do something similar for unknown_value
Are you thinking of encoding unknown values as num_categories as well?
Are you thinking of encoding unknown values as num_categories as well?
I would advise against that, or at least make it configurable. In many cases, absense of a feature (i.e. nan) has a lot of signal (for example, it might mean "no data from certain source"), and this signal is different from "has some data, but it hasn't been seen before"
Another option would be to encode rare categories as most common category - so we have the option to either automatically map to most common, or map into user-set category.
Describe the workflow you want to enable
Currently, NaN values in OrdinalEncoder are either passed through as NaN, or encoded into user-specified value.
It would be nice to have a third option: consider NaN values as another category and map them into
num_categories + 1
or some other value.Describe your proposed solution
Add another
encoded_missing_value
optionauto
, that encodes them into another categoryDescribe alternatives you've considered, if relevant
No response
Additional context
There is also some confusion with user-specified value: if for example I set this value as
0
, will it interfere with0
category that was present during fit? Or all categories will be moved accordingly?Manually setting some other values like
1000000
or-1
is usually incompatible with common categorical features interfaces, e.g. nn.Embedding from Pytorch and so onThe text was updated successfully, but these errors were encountered: