Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improves error message for mixed types for feature names #25018

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions sklearn/utils/tests/test_validation.py
Expand Up @@ -1675,8 +1675,9 @@ def test_get_feature_names_invalid_dtypes(names, dtypes):
X = pd.DataFrame([[1, 2], [4, 5], [5, 6]], columns=names)

msg = re.escape(
"Feature names only support names that are all strings. "
f"Got feature names with dtypes: {dtypes}."
"Feature names only support names that are all strings. Got feature names with"
f" dtypes: {dtypes}. Please convert to a common type, for example using"
" X.columns = X.columns.astype(str)"
)
with pytest.raises(TypeError, match=msg):
names = _get_feature_names(X)
Expand Down
5 changes: 3 additions & 2 deletions sklearn/utils/validation.py
Expand Up @@ -1884,8 +1884,9 @@ def _get_feature_names(X):
# mixed type of string and non-string is not supported
if len(types) > 1 and "str" in types:
raise TypeError(
"Feature names only support names that are all strings. "
f"Got feature names with dtypes: {types}."
"Feature names only support names that are all strings. Got feature names"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. the error message sounds like the problem is that I mixed strings and not-strings for my feature names. Which makes me think that if I exclusively had not-strings then it would work. I don't think this is true though, is it? -> "Feature names have to be strings. Got feature names ..."
  2. Should we help out users by telling them that it is the names of the columns of their pandas DF that need fixing? -> change "feature names" to "column names"? In the context of a pandas DF there is no such thing as a "feature name", as a user the thing I set was a column name. A bit later in the error message we give a hint that column names is the problem, so why not do that at the start of the error message as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 1., the following would work:

In [1]: from sklearn.datasets import make_regression
   ...: from sklearn.tree import DecisionTreeRegressor
   ...: import pandas as pd
   ...: 
   ...: X, y = make_regression()
   ...: X_df = pd.DataFrame(X)
   ...: 
   ...: tree = DecisionTreeRegressor().fit(X_df, y)

However, feature_names_in_ will not exist because X.columns are not strings (which is documented).

Since at this point, we only want to fit a tree and potentially we potentially have a user that does not care about the feature_names, I think it could be enough to mention "if you want support for feature names, convert it to str...".

f" with dtypes: {types}. Please convert to a common type, for example using"
" X.columns = X.columns.astype(str)"
)

# Only feature names of all strings are supported
Expand Down