Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improves error message for mixed types for feature names #25018

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions sklearn/tests/test_base.py
Expand Up @@ -651,8 +651,9 @@ def transform(self, X):
df_mixed = pd.DataFrame(X_np, columns=["a", "b", 1, 2])
trans = NoOpTransformer()
msg = re.escape(
"Feature names only support names that are all strings. "
"Got feature names with dtypes: ['int', 'str']"
"Feature names only support column names that are all strings, but got dtypes:"
" ['int', 'str']. If you want support for feature names, convert the"
" column names to strings."
)
with pytest.raises(TypeError, match=msg):
trans.fit(df_mixed)
Expand Down
5 changes: 3 additions & 2 deletions sklearn/utils/tests/test_validation.py
Expand Up @@ -1675,8 +1675,9 @@ def test_get_feature_names_invalid_dtypes(names, dtypes):
X = pd.DataFrame([[1, 2], [4, 5], [5, 6]], columns=names)

msg = re.escape(
"Feature names only support names that are all strings. "
f"Got feature names with dtypes: {dtypes}."
"Feature names only support column names that are all strings, but got dtypes:"
Copy link
Member

@betatim betatim Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"Feature names only support column names that are all strings, but got dtypes:"
"Feature names only support column names that are strings, but got dtypes:"

The "all" is unnecessary and I find it confuses me more/makes for tricky reading/doesn't make things easier to understand.

(Only commenting once here, instead of on all tests and where the message comes from)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think w/o the all here, it could mean that only string feature names are supported and the rest can be ignored. The two don't mean the same thing to me, and the latter doesn't seem correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understood the feature is that if you specify a column name, it has to be a string. No other types are allowed. Is that right? If not ignore the below.

For me "column names that are all strings" is like "food made from all organic ingredients". The "all" is a statement about the ingredients, which in this case are exclusively organic. As a single column name can't consist of two different types it seems weird to say "all strings". "All column names have to be strings" is a statement about the column names, they all have to be strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a logical fallacy there in your example. feature names are separate things, but food is one thing, therefore your example would be more like "food made from only organic ingredients", or a better example: "we can allow a group of people who are all above 18", and to me it makes senes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the code I think the answer to my question is: no.

This means that after considering only the exception's message and the few lines of code around it I didn't understand what the problem was. Hence I don't have a good idea how to succinctly explain to a user what just happened and what they need to do to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ☝️ is an improvement over what we currently have

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed the change to use Adrin's suggestion because it looks uncontroversially more clear and I think it's still the middle of the night for Thomas :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the last sentence about silencing the warning since it's not a warning actually, good catch from #25018 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should tell people how to disable it though. They don't have to pass feature names, this makes it look like they have to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I put it back but without mentioning the warning. Is it fine now ?

f" {dtypes}. If you want support for feature names, convert the"
" column names to strings."
)
with pytest.raises(TypeError, match=msg):
names = _get_feature_names(X)
Expand Down
6 changes: 4 additions & 2 deletions sklearn/utils/validation.py
Expand Up @@ -1884,8 +1884,10 @@ def _get_feature_names(X):
# mixed type of string and non-string is not supported
if len(types) > 1 and "str" in types:
raise TypeError(
"Feature names only support names that are all strings. "
f"Got feature names with dtypes: {types}."
"Feature names only support column names that are all strings, but got"
f" dtypes: {types}. If you want support for feature names, convert the"
" column names to strings. For example, by using X.columns ="
" X.columns.astype(str)."
)

# Only feature names of all strings are supported
Expand Down