-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hypothesis examples are all the same #1579
Comments
Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g. schema = DataFrameSchema(
{
"column1": Column(int, Check.ge(0)),
"column2": Column(int, [Check.in_range(1, 100)]), # 👈 use a single in_range check instead of ge and le
"column3": Column(float, Check.ge(0)),
"column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
}
) produces
|
Okay, so it seems like generating smaller dataframes yields higher entropy results: print(schema.example(size=5))
# generates different datasets
column1 column2 column3 column4
0 152 1 9.007199e+15 BBB
1 9223372036854775807 1 1.192093e-07 CCC
2 4148323564460896226 56 6.189641e+16 BBB
3 123 83 6.103516e-05 CCC
4 32240 2 1.112537e-308 BBB print(schema.example(size=10))
# we see this consistently
column1 column2 column3 column4
0 31078 1 0.0 AAA
1 0 1 0.0 AAA
2 0 1 0.0 AAA
3 0 1 0.0 AAA
4 0 1 0.0 AAA
5 0 1 0.0 AAA
6 0 1 0.0 AAA
7 0 1 0.0 AAA
8 0 1 0.0 AAA
9 0 1 0.0 AAA @tmcclintock recommendations would be:
@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on |
|
Thanks, both. Feel free to close this issue if you feel like it. FWIW, IMO the |
It might make sense to bring back the warning that |
Yup, that's what I use it for. Pandera let's me mock entire machine learning pipelines. The issue is, usually I want more than 5 rows of mock data :). |
closing this issue, @tmcclintock FYI I created #1625 to articulate what would be needed to improve the performance of pandera strategies overall. |
Describe the bug
Calling
schema.example()
generatessize
number of identical rows. This is not desirable, since the whole purpose ofpandera
+hypothesis
is to create rich examples with very a lot of variety in the rows.Code Sample, a copy-pastable example
yields
Note that both the pandera and hypothesis versions are latest.
Expected behavior
All 20 columns look very different
Desktop (please complete the following information):
python --version --version
yieldsAdditional context
I suspected this is related to #1503, but after doing
pip install pandera==0.18.0 hypothesis
the issue still persisted. Same thing when I dropped down to pandera 0.17.2.Also the same behavior occurs when using
DataFrameModel
s.The only time I can get high-entropy rows is when
checks
is aCheck
and not alist[Check]
like this:The output of this is as expected:
However this is wayyy less useful then having all the checks apply.
The text was updated successfully, but these errors were encountered: