New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Improves error message for mixed types for feature names #25018
ENH Improves error message for mixed types for feature names #25018
Conversation
sklearn/utils/validation.py
Outdated
@@ -1884,8 +1884,9 @@ def _get_feature_names(X): | |||
# mixed type of string and non-string is not supported | |||
if len(types) > 1 and "str" in types: | |||
raise TypeError( | |||
"Feature names only support names that are all strings. " | |||
f"Got feature names with dtypes: {types}." | |||
"Feature names only support names that are all strings. Got feature names" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts:
- the error message sounds like the problem is that I mixed strings and not-strings for my feature names. Which makes me think that if I exclusively had not-strings then it would work. I don't think this is true though, is it? -> "Feature names have to be strings. Got feature names ..."
- Should we help out users by telling them that it is the names of the columns of their pandas DF that need fixing? -> change "feature names" to "column names"? In the context of a pandas DF there is no such thing as a "feature name", as a user the thing I set was a column name. A bit later in the error message we give a hint that column names is the problem, so why not do that at the start of the error message as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding 1., the following would work:
In [1]: from sklearn.datasets import make_regression
...: from sklearn.tree import DecisionTreeRegressor
...: import pandas as pd
...:
...: X, y = make_regression()
...: X_df = pd.DataFrame(X)
...:
...: tree = DecisionTreeRegressor().fit(X_df, y)
However, feature_names_in_
will not exist because X.columns
are not strings (which is documented).
Since at this point, we only want to fit
a tree and potentially we potentially have a user that does not care about the feature_names
, I think it could be enough to mention "if you want support for feature names, convert it to str...".
"Feature names only support names that are all strings. Got feature names with" | ||
f" dtypes: {dtypes}. Please convert to a common type, for example using" | ||
" X.columns = X.columns.astype(str)" | ||
"Feature names only support column names that are all strings, but got dtypes:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"Feature names only support column names that are all strings, but got dtypes:" | |
"Feature names only support column names that are strings, but got dtypes:" |
The "all" is unnecessary and I find it confuses me more/makes for tricky reading/doesn't make things easier to understand.
(Only commenting once here, instead of on all tests and where the message comes from)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think w/o the all
here, it could mean that only string feature names are supported and the rest can be ignored. The two don't mean the same thing to me, and the latter doesn't seem correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I understood the feature is that if you specify a column name, it has to be a string. No other types are allowed. Is that right? If not ignore the below.
For me "column names that are all strings" is like "food made from all organic ingredients". The "all" is a statement about the ingredients, which in this case are exclusively organic. As a single column name can't consist of two different types it seems weird to say "all strings". "All column names have to be strings" is a statement about the column names, they all have to be strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a logical fallacy there in your example. feature names are separate things, but food is one thing, therefore your example would be more like "food made from only organic ingredients", or a better example: "we can allow a group of people who are all above 18", and to me it makes senes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading the code I think the answer to my question is: no.
This means that after considering only the exception's message and the few lines of code around it I didn't understand what the problem was. Hence I don't have a good idea how to succinctly explain to a user what just happened and what they need to do to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ☝️ is an improvement over what we currently have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly pushed the change to use Adrin's suggestion because it looks uncontroversially more clear and I think it's still the middle of the night for Thomas :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the last sentence about silencing the warning since it's not a warning actually, good catch from #25018 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should tell people how to disable it though. They don't have to pass feature names, this makes it look like they have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I put it back but without mentioning the warning. Is it fine now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment but also LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (english is not my native language, I let you decide about the "all" thing 😄)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR improves the error message for feature names that have mixed dtypes. This was issue was raised in https://gitter.im/scikit-learn/dev by @amueller