Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improves error message for mixed types for feature names #25018

Merged

Conversation

thomasjpfan
Copy link
Member

This PR improves the error message for feature names that have mixed dtypes. This was issue was raised in https://gitter.im/scikit-learn/dev by @amueller

@thomasjpfan thomasjpfan added No Changelog Needed Quick Review For PRs that are quick to review labels Nov 23, 2022
@thomasjpfan thomasjpfan added this to the 1.2 milestone Nov 23, 2022
@@ -1884,8 +1884,9 @@ def _get_feature_names(X):
# mixed type of string and non-string is not supported
if len(types) > 1 and "str" in types:
raise TypeError(
"Feature names only support names that are all strings. "
f"Got feature names with dtypes: {types}."
"Feature names only support names that are all strings. Got feature names"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. the error message sounds like the problem is that I mixed strings and not-strings for my feature names. Which makes me think that if I exclusively had not-strings then it would work. I don't think this is true though, is it? -> "Feature names have to be strings. Got feature names ..."
  2. Should we help out users by telling them that it is the names of the columns of their pandas DF that need fixing? -> change "feature names" to "column names"? In the context of a pandas DF there is no such thing as a "feature name", as a user the thing I set was a column name. A bit later in the error message we give a hint that column names is the problem, so why not do that at the start of the error message as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1., the following would work:

In [1]: from sklearn.datasets import make_regression
   ...: from sklearn.tree import DecisionTreeRegressor
   ...: import pandas as pd
   ...: 
   ...: X, y = make_regression()
   ...: X_df = pd.DataFrame(X)
   ...: 
   ...: tree = DecisionTreeRegressor().fit(X_df, y)

However, feature_names_in_ will not exist because X.columns are not strings (which is documented).

Since at this point, we only want to fit a tree and potentially we potentially have a user that does not care about the feature_names, I think it could be enough to mention "if you want support for feature names, convert it to str...".

"Feature names only support names that are all strings. Got feature names with"
f" dtypes: {dtypes}. Please convert to a common type, for example using"
" X.columns = X.columns.astype(str)"
"Feature names only support column names that are all strings, but got dtypes:"
Copy link
Member

@betatim betatim Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"Feature names only support column names that are all strings, but got dtypes:"
"Feature names only support column names that are strings, but got dtypes:"

The "all" is unnecessary and I find it confuses me more/makes for tricky reading/doesn't make things easier to understand.

(Only commenting once here, instead of on all tests and where the message comes from)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think w/o the all here, it could mean that only string feature names are supported and the rest can be ignored. The two don't mean the same thing to me, and the latter doesn't seem correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understood the feature is that if you specify a column name, it has to be a string. No other types are allowed. Is that right? If not ignore the below.

For me "column names that are all strings" is like "food made from all organic ingredients". The "all" is a statement about the ingredients, which in this case are exclusively organic. As a single column name can't consist of two different types it seems weird to say "all strings". "All column names have to be strings" is a statement about the column names, they all have to be strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a logical fallacy there in your example. feature names are separate things, but food is one thing, therefore your example would be more like "food made from only organic ingredients", or a better example: "we can allow a group of people who are all above 18", and to me it makes senes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the code I think the answer to my question is: no.

This means that after considering only the exception's message and the few lines of code around it I didn't understand what the problem was. Hence I don't have a good idea how to succinctly explain to a user what just happened and what they need to do to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ☝️ is an improvement over what we currently have

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed the change to use Adrin's suggestion because it looks uncontroversially more clear and I think it's still the middle of the night for Thomas :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the last sentence about silencing the warning since it's not a warning actually, good catch from #25018 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should tell people how to disable it though. They don't have to pass feature names, this makes it look like they have to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I put it back but without mentioning the warning. Is it fine now ?

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment but also LGTM

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (english is not my native language, I let you decide about the "all" thing 😄)

sklearn/tests/test_base.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel ogrisel merged commit f196344 into scikit-learn:main Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants