Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easily retrieve mapping from OrdinalEncoder #28891

Open
oasidorshin opened this issue Apr 25, 2024 · 3 comments
Open

Easily retrieve mapping from OrdinalEncoder #28891

oasidorshin opened this issue Apr 25, 2024 · 3 comments
Labels

Comments

@oasidorshin
Copy link

Describe the workflow you want to enable

It would be nice to be able to easily retrieve mapping in the form of a dictionary

"category_a": 0,
"category_b": 1,
"category_infrequent": 2,
...

Currently .categories_ attribute only retrieves list of seen categories, without mapping.

This becomes especially important with options to handle missing or infrequent values, which are leading to questions "What value does infrequent categories map to?" and so on.

Describe your proposed solution

Add .categories_map_ attribute to OrdinalEncoder

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@oasidorshin oasidorshin added Needs Triage Issue requires triage New Feature labels Apr 25, 2024
@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2024

It's possible to construct the category maps via list and dict comprehensions:

category_maps = [
    {cat_name: cat_idx for cat_idx, cat_name in enumerate(cat_list)}
    for cat_list in oe.categories_
]

or as a dict of dicts to leverage feature names when available:

category_maps = {
    feature_name: {cat_name: cat_idx for cat_idx, cat_name in enumerate(cat_list)}
    for feature_name, cat_list in zip(oe.feature_names_in_, oe.categories_)
}

Maybe we could expose a public property (or two, in the presence of the feature names) on the OrdinalEncoder but I also worry about the magic aspect of properties.

Precomputing those reverse mappings is also possible but redundant.

Another alternative would to use methods (instead of properties) to generate those reverse mappings on depend, maybe with optional argument to do it only for a specific set of features.

@ogrisel ogrisel added API Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Apr 29, 2024
@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2024

This becomes especially important with options to handle missing or infrequent values, which are leading to questions "What value does infrequent categories map to?" and so on.

I am not sure how providing encoding maps would help answer that question: what would be the "key" in the dict for infrequent or unseen categories?

But we could expose dedicated fitted attributes to make the resulting encoding of missing and infrequent/unseen categories explicit for each feature.

EDIT: such an attribute already exists for infrequent categories: OrdinalEncoder.infrequent_categories_

@oasidorshin
Copy link
Author

@ogrisel

Thank you for your comments!

I will try to argue more about my motivation for each question:

  1. Possible to construct the mapping by hand - my main gripe with this approach is that the assumption "order of categories in this list == encoded labels" is currently not outlined in the docs. So the solution would be to either outline encoding algorithm (inc. missing/unseen values) it in the docs, or to give method to produce exact mapping (which would still be very useful in reducing boilerplate code in downstream applications)

2.1. What would be the "key" for unseen categories - I think it is either some predefined "_unseen_category_hope_not_in_user_data", which is ugly, or possibly some new attribute like .unseen_encoded_value

2.2. OrdinalEncoder.infrequent_categories_ - while it is very useful for checking what categories turned out to be infrequent, it still doesnt answer the question into which value they are being encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants