Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis examples are all the same #1579

Closed
2 of 3 tasks
tmcclintock opened this issue Apr 16, 2024 · 7 comments
Closed
2 of 3 tasks

Hypothesis examples are all the same #1579

tmcclintock opened this issue Apr 16, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@tmcclintock
Copy link
Contributor

tmcclintock commented Apr 16, 2024

Describe the bug
Calling schema.example() generates size number of identical rows. This is not desirable, since the whole purpose of pandera + hypothesis is to create rich examples with very a lot of variety in the rows.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import hypothesis
import pandera
from pandera import Check, Column, DataFrameSchema

print(hypothesis.__version__, pandera.__version__)

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.ge(1), Check.le(100)]),
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

print(schema.example(size=20))

yields

6.100.1 0.18.3
    column1  column2  column3 column4
0         0        1      0.0     AAA
1         0        1      0.0     AAA
2         0        1      0.0     AAA
3         0        1      0.0     AAA
4         0        1      0.0     AAA
5         0        1      0.0     AAA
6         0        1      0.0     AAA
7         0        1      0.0     AAA
8         0        1      0.0     AAA
9         0        1      0.0     AAA
10        0        1      0.0     AAA
11        0        1      0.0     AAA
12        0        1      0.0     AAA
13        0        1      0.0     AAA
14        0        1      0.0     AAA
15        0        1      0.0     AAA
16        0        1      0.0     AAA
17        0        1      0.0     AAA
18        0        1      0.0     AAA
19        0        1      0.0     AAA

Note that both the pandera and hypothesis versions are latest.

Expected behavior

All 20 columns look very different

Desktop (please complete the following information):

  • OS: Mac 14.4.1 with M1 Pro
  • python --version --version yields
Python 3.9.17 (main, Jun 29 2023, 09:32:35) 
[Clang 14.0.3 (clang-1403.0.22.14.1)]
  • pandera 0.18.3
  • hypothesis 6.100.1

Additional context

I suspected this is related to #1503, but after doing pip install pandera==0.18.0 hypothesis the issue still persisted. Same thing when I dropped down to pandera 0.17.2.

Also the same behavior occurs when using DataFrameModels.

The only time I can get high-entropy rows is when checks is a Check and not a list[Check] like this:

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, Check.le(100)),
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

The output of this is as expected:

              column1              column2       column3 column4
0               57854                   42  1.401298e-45     BBB
1               40006          -1198404347  1.192093e-07     BBB
2               44174                   55  2.220446e-16     CCC
3         12935430764 -4986092864707543051  1.000000e-05     AAA

However this is wayyy less useful then having all the checks apply.

@tmcclintock tmcclintock added the bug Something isn't working label Apr 16, 2024
@cosmicBboy
Copy link
Collaborator

Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.in_range(1, 100)]),  # 👈 use a single in_range check instead of ge and le
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

produces

6.100.1 0.0.0+dev0
                column1  column2        column3 column4
0                     0        1   3.402823e+38     AAA
1                     0        1   2.882304e+16     CCC
2                     0        6   2.000010e+00     BBB
3                   247       47   9.999900e-01     BBB
4                 19526       50  1.390036e+164     AAA
5                 56223       63  2.225074e-308     AAA
6                    42       15   7.357397e+15     BBB
7                    97       62   9.999900e-01     CCC
8                     0       69   3.293796e+09     AAA
9   9216616637413720064        4   1.000000e+07     AAA
10    23090105669335094       14   5.397605e-78     CCC
11                    0       50   1.192093e-07     CCC
12           1260840409       98   1.500000e+00     AAA
13                21966       68   1.100000e+00     AAA
14                23289       21   3.333333e-01     CCC
15   912854047966763290       27   6.519203e+16     BBB
16  8876389219764502267        9  5.706631e-178     CCC
17                40004       40   1.500000e+00     CCC
18                  247       77   5.742309e+16     BBB
19                47285       17   1.175494e-38     AAA

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 18, 2024

Okay, so it seems like generating smaller dataframes yields higher entropy results:

print(schema.example(size=5))

# generates different datasets
               column1  column2        column3 column4
0                  152        1   9.007199e+15     BBB
1  9223372036854775807        1   1.192093e-07     CCC
2  4148323564460896226       56   6.189641e+16     BBB
3                  123       83   6.103516e-05     CCC
4                32240        2  1.112537e-308     BBB
print(schema.example(size=10))

# we see this consistently
   column1  column2  column3 column4
0    31078        1      0.0     AAA
1        0        1      0.0     AAA
2        0        1      0.0     AAA
3        0        1      0.0     AAA
4        0        1      0.0     AAA
5        0        1      0.0     AAA
6        0        1      0.0     AAA
7        0        1      0.0     AAA
8        0        1      0.0     AAA
9        0        1      0.0     AAA

@tmcclintock recommendations would be:

  • generate a bunch of smaller dataframes and concat them, it seems like dataframes of about size 5 is the magic number.
  • restrict your schemas to have only one check (this is pretty unreasonable though).

@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter, but that'll require a larger refactoring project.

@Zac-HD
Copy link
Contributor

Zac-HD commented Apr 18, 2024

  1. Check whether you see more-diverse outputs if you actually run the test? Strategies' .example() method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.
  2. Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.

@tmcclintock
Copy link
Contributor Author

Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the .example() API of pandera is one of its strongest features. I'd love for it to be performant one day!

@cosmicBboy
Copy link
Collaborator

It might make sense to bring back the warning that hypothesis raises with example. It's really meant more for interactively debugging and examining strategies, and not for any serious production context. The intended use of it really is as demonstrated here https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#usage-in-unit-tests.

@tmcclintock
Copy link
Contributor Author

Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :).

@cosmicBboy
Copy link
Collaborator

closing this issue, @tmcclintock FYI I created #1625 to articulate what would be needed to improve the performance of pandera strategies overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants