Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FunctionTransformer need feature_names_out even if func returns DataFrame #28780

Open
fedorkobak opened this issue Apr 6, 2024 · 6 comments
Open
Labels

Comments

@fedorkobak
Copy link

fedorkobak commented Apr 6, 2024

Describe the bug

Trying to call transform for FunctionTransformer for which feature_names_out is configured raises error that advises to use set_output(transform='pandas'). But this doesn't change anything.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

my_transformer = FunctionTransformer(
    lambda X : pd.concat(
        [
            X[col].rename(f"{col} {str(power)}")**power
            for col in X
            for power in range(2,4)
        ],
        axis=1
    ),
    feature_names_out = (
        lambda transformer, input_features: [
            f"{feature} {power_str}"
            for feature in input_features
            for power_str in ["square", "cubic"]
        ]
    )
)
# I specified transform=pandas
my_transformer.set_output(transform='pandas')
sample_size = 10
X = pd.DataFrame({
    "feature 1" : [1,2,3,4,5],
    "feature 2" : [3,4,5,6,7]
})
my_transformer.fit(X)
my_transformer.transform(X)

Expected Results

pandas.DataFrame like following

feature 1 square feature 1 cubic feature 2 square feature 2 cubic
0 1 1 9 27
1 4 8 16 64
2 9 27 25 125
3 16 84 36 216
4 25 125 49 343

Actual Results

ValueError: The output generated by `func` have different column names than the ones provided by `get_feature_names_out`. Got output with columns names: ['feature 1 2', 'feature 1 3', 'feature 2 2', 'feature 2 3'] and `get_feature_names_out` returned: ['feature 1 square', 'feature 1 cubic', 'feature 2 square', 'feature 2 cubic']. The column names can be overridden by setting `set_output(transform='pandas')` or `set_output(transform='polars')` such that the column names are set to the names provided by `get_feature_names_out`.

Versions

System:
    python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.5.0-14-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 68.2.2
        numpy: 1.24.2
        scipy: 1.11.1
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.7.1
       joblib: 1.3.1
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/fedor/.local/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/fedor/.local/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 12

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/fedor/.local/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Haswell
    num_threads: 12
@fedorkobak fedorkobak added Bug Needs Triage Issue requires triage labels Apr 6, 2024
@lesteve
Copy link
Member

lesteve commented Apr 8, 2024

There are at least two things:

  1. you want to fix your code, not setting feature_names_out would work. If you want to tweak the column names, I would suggest you do it in your FunctionTransformer func argument (i.e. first positional argument).
  2. you are saying that the error message is not super helpful in your case. It does say that the column names don't match but I would agree that the part about "The column names can be overridden" seems to imply you can change the column names without worrying that they don't match

@fedorkobak
Copy link
Author

fedorkobak commented Apr 8, 2024

@lesteve you're right. If I change my code snippet like following:

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

my_transformer = FunctionTransformer(
    lambda X : pd.DataFrame(
        {
            f"{str(col)}^{power}" : X[col]**power
            for col in X
            for power in range(2,4)
        }
    ),
    feature_names_out = (
        lambda transformer, input_features: [
            f"{str(feature)}^{power}"
            for feature in input_features
            for power in range(2,4)
        ]
    )
)

my_transformer.set_output(transform='pandas')
sample_size = 10
X = pd.DataFrame({
    "feature 1" : [1,2,3,4,5],
    "feature 2" : [3,4,5,6,7]
})
my_transformer.fit(X)
my_transformer.transform(X)
my_transformer.get_feature_names_out()

So output columns of func the same as result of feature_names_out - everything goes fine. Thank you.

But in my opinion it would be more intuitive if FunctionTransformer would just use the result of features_names_out - because you'll need to define it anyway if you want to build a pipeline that can provide information about feature names in the later steps. Also, in older versions of sklearn it was like this - just try my first snippet in version 1.3.0 - everything works fine.

If that's the way it's intended, you can close this issue.

@lesteve
Copy link
Member

lesteve commented Apr 9, 2024

This seems related to #28241 and #27801. cc @glemaitre since he has this in his brain cache more than me.

My naive (and apparently wrong) expectation would have been that if your func returns a DataFrame, you don't need to use the feature_names_out argument and the get_feature_names_out returns the columns of the DataFrame that func returns.

@fedorkobak small tip: you can use syntax highlighting in markdown to make code snippets more readable, see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks#syntax-highlighting for more details. I have edited your comment accordingly.

@glemaitre
Copy link
Member

The error message might miss an information: if you use set_output then, you can remove feature_names_out. If both are set then they need to be consistent. So we might further improved the error message.

@fedorkobak
Copy link
Author

fedorkobak commented Apr 9, 2024

@glemaitre. Do you mean something like this?

from sklearn.preprocessing import FunctionTransformer
import pandas as pd

my_transformer = FunctionTransformer(
    lambda X : pd.DataFrame(
        {
            f"{str(col)}^{power}" : X[col]**power
            for col in X
            for power in range(2,4)
        }
    )
   # no features_names_out
)
X = pd.DataFrame({
    "feature 1" : [1,2,3,4,5],
    "feature 2" : [3,4,5,6,7]
})
my_transformer.set_output(transform="pandas")
my_transformer.fit_transform(X)
# raises: AttributeError: This 'FunctionTransformer' has no attribute 'get_feature_names_out'
my_transformer.get_feature_names_out()

I called set_output(transform="pandas") from the transformer and didn't pass feature_names_out to the constructor. As far as I understand, it has to return columns of the output dataframe in get_features_names_out. But it throws another error - AttributeError: This 'FunctionTransformer' does not have an attribute 'get_feature_names_out.

@lesteve
Copy link
Member

lesteve commented Apr 10, 2024

The error message might miss an information: if you use set_output then, you can remove feature_names_out

I was expecting something similar but it doesn't seem to work as I hinted above and the snippet in #28780 (comment) shows. You need to specify feature_names_out otherwise .get_feature_names_out does not exist ...

@lesteve lesteve removed the Needs Triage Issue requires triage label Apr 10, 2024
@lesteve lesteve changed the title FunctionTransformer ignores set_output(transform='pandas') which raises ValueError when setting columns for output FunctionTransformer need feature_names_out even if func returns DataFrame Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants