Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

Closed
crypdick opened this issue Feb 3, 2021 · 15 comments · Fixed by #410
Closed

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

crypdick opened this issue Feb 3, 2021 · 15 comments · Fixed by #410
Labels
enhancement New feature or request
Milestone

Comments

@crypdick
Copy link

crypdick commented Feb 3, 2021

Is your feature request related to a problem? Please describe.
Pandera's new synthesis feature is very attractive, but is broken if the constraints are non-trivial (e.g. pa.Check.str_matches(complicated_regex)). Related discussion here.

Describe the solution you'd like
Ideally, we could use complex schema for both validation and synthesis. In order for that to be possible, Pandera needs to generate data more efficiently, which means not using rejection sampling. Luckily, Hypothesis has solved the efficient synthesis problem, but Hypothesis strategies can't be used for validation.

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:

my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))

The Schema, then, could use hypothesis for efficient data generation.

Describe alternatives you've considered
I've tried disabling Hypothesis's health checks but Hypothesis literally ran out of random bytes to parse into your dataframe, and there's not really anything that [Hypothesis] can do about that

Additional context
We were attracted to Pandera over Great Expectations due to the ability to create hypothesis strategies directly from schemas. This saves us the labor of maintaining separate validation schemas and Hypothesis strategies.

@crypdick crypdick added the enhancement New feature or request label Feb 3, 2021
@cosmicBboy
Copy link
Collaborator

hey @crypdick, yes getting data synthesis right is pretty challenging and I'd like to smooth out the rough edges. Just so I get a better sense of the issue, can you provide an example schema that's causing the health check issues?

@crypdick
Copy link
Author

crypdick commented Feb 3, 2021

@cosmicBboy sure! I had to obfuscate sensitive details but hopefully this gives you a rough idea.

import re

import pandera as pa

# simplified for security
IMG_URL_REGEX = re.compile(
    r"""
    ^s3://   # prefix
    [-a-z0-9]+/
    [A-Za-z0-9-]+/
    20[1-2][0-9]+/  # years 2010-2029
    [A-Za-z0-9-_,.]+/  # including _, commas, period
    v[0-9]/  # versions v0-v9
    2[4-5]/ 
    [0-9]+/
    [0-9]+  # file name
    [.]png$  # extension, escape dot
    """,
    re.X,
)


valid_ids = {str(i) for i in range(1, 1000)}   # placeholder for more complicated logic

is_valid_img_url = pa.Check.str_matches(IMG_URL_REGEX)
is_not_empty_series = pa.Check(lambda series_: len(series_) > 0, name="Series not empty")
is_valid_id = pa.Check.isin(valid_ids)

# class IDs joined by |
# this later gets filtered to ensure each element is a valid ID (invalid IDs throw Exceptions downstream)
numeric_str_delim_by_pipes = re.compile(
    r"""[0-9]{1,4}  # in reality, this is more complicated
        (\|[0-9]{1,4})*  # 0 or more IDs
        """,
    re.X,
)

grouped_by_s3uri_schema = pa.DataFrameSchema(
    columns={
        "img_url": pa.Column(
            pa.String,
            allow_duplicates=False,
            nullable=False,
            checks=[is_valid_img_url, is_not_empty_series],  # most restrictive check first
        ),
        "labels": pa.Column(
            pa.String,
            checks=[
                pa.Check.str_matches(numeric_str_delim_by_pipes),
                pa.Check(
                    lambda str_delim_pipes: all([id_ in valid_ids for id_ in str_delim_pipes.split("|")]),
                    element_wise=True,
                ),
            ],
        ),
    }
)

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Feb 3, 2021

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:

my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))

The Schema, then, could use hypothesis for efficient data generation.

This sounds complicated 😅 (I did sort of look into this when building out the data synthesis functionality but found it way over my head, I'm not sure if there's a straightforward way of parsing complex multi-part strategies)

Note that rejection sampling only occurs after the first check in the checks list, so inefficiencies in the first check would not be due to rejection sampling. Testing the code out locally it looks like img_url checks are leading to data_too_large and large_base_example, and labels checks are leading to filter_too_much issues.

I think it's up to pandera to make data generation as easy as possible and providing users with an interface to customize data synthesis using the hypothesis API when built-in checks/strategies aren't up to snuff.

Another potential solution: custom checks and strategies

The extensions API was designed to give users access to the hypothesis API to make custom data synthesis strategies. This lets you register custom checks into the Check namespace and associate it with a data-generation strategy. You can effectively combine multiple checks into one, or implement any hypothesis strategy you want with hypothesis (with a little bit of ceremony imposed by pandera).

You can do something like:

import re
from typing import Optional

import hypothesis
import hypothesis.strategies as st
import pandas as pd
import pandera as pa
import pandera.extensions as extensions


# simplified for security
IMG_URL_REGEX = re.compile(
    r"""
    ^s3://   # prefix
    [-a-z0-9]+/
    [A-Za-z0-9-]+/
    20[1-2][0-9]+/  # years 2010-2029
    [A-Za-z0-9-_,.]+/  # including _, commas, period
    v[0-9]/  # versions v0-v9
    2[4-5]/ 
    [0-9]+/
    [0-9]+  # file name
    [.]png$  # extension, escape dot
    """,
    re.X,
)

NUMERIC_STR_DELIM_REGEX = re.compile(
    r"""[0-9]{1,4}  # in reality, this is more complicated
        (\|[0-9]{1,4})*  # 0 or more IDs
        """,
    re.X,
)

VALID_IDS = [str(i) for i in range(1, 1000)]


# Define custom url strategy and check
def url_strategy(
    pandas_dtype: pa.PandasDtype,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    url_regex,
):
    if strategy is None:
        # replace this with more efficient Hypothesis strategy if desired
        return st.from_regex(url_regex, fullmatch=True)
    raise pa.errors.BaseStrategyOnlyError(
        "'url_strategy' must be a base strategy"
    )


@extensions.register_check_method(
    statistics=["url_regex"],
    strategy=url_strategy,
    supported_types=pd.Series,
)
def valid_url(pandas_obj, *, url_regex):
    """Url regex check."""
    return pandas_obj.str.match(url_regex, na=False)


# Define custom label strategy and check
def labels_strategy(
    pandas_dtype: pa.PandasDtype,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    valid_ids,
):
    if strategy is None:
        # replace this with more efficient Hypothesis strategy if desired
        return st.lists(
            st.sampled_from(VALID_IDS), unique=True, min_size=1
        ).map("|".join)
    raise pa.errors.BaseStrategyOnlyError(
        "'labels_strategy' must be a base strategy"
    )


@extensions.register_check_method(
    statistics=["valid_ids"],
    strategy=labels_strategy,
    supported_types=pd.Series,
)
def valid_labels(pandas_obj, *, valid_ids):
    # combines the regex match check and the valid_ids check
    valid_ids = set(valid_ids)
    return pandas_obj.str.match(NUMERIC_STR_DELIM_REGEX) & (
        pandas_obj.map(lambda x: all(id_ in valid_ids for id_ in x.split("|")))
    )


schema = pa.DataFrameSchema(
    columns={
        "img_url": pa.Column(
            pa.String,
            allow_duplicates=False,
            nullable=False,
            checks=[
                pa.Check.valid_url(url_regex=IMG_URL_REGEX),
                pa.Check(
                    lambda series_: len(series_) > 0, name="Series not empty"
                ),
            ],
        ),
        "labels": pa.Column(
            pa.String,
            checks=pa.Check.valid_labels(valid_ids=VALID_IDS),
        ),
    }
)


@hypothesis.given(schema.strategy(size=5))
def test_schema(df):
    print(df)
    # test something

Phew! thanks for bearing with me. All that said, I did find some inefficiencies in the way the dataframe strategy was being constructed: #400 << this PR should actually make the img_url health checks go away and the code as you had it in your sample code snippet should work. Feel free to pull the master branch and try it out locally!

For the label generation code I'd recommend using the labels_strategy + valid_labels approach that I implemented above, since it generates data by sampling the set of valid ids instead of from the regex, which guarantees that | delimited strings contain valid ids

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Feb 3, 2021

Other types of solutions that would be fairly heavy lifts i.e. going into the guts of the pandera.strategies module:

  • supporting composite strategies as a way of chaining multiple checks together.
  • providing an interface for "multi-check resolution", which basically aggregates the statistics of multiple checks and constructs a single strategy from them.

@cosmicBboy cosmicBboy added this to the 0.6.2 release milestone Feb 5, 2021
@crypdick
Copy link
Author

crypdick commented Feb 5, 2021

@cosmicBboy Tyvm for the detailed answer! When I copy-paste your code, I get a TypeError: url_strategy() missing 1 required keyword-only argument: 'check_regex'.

@cosmicBboy
Copy link
Collaborator

woops! just edited the code snippet , should work now

@crypdick
Copy link
Author

crypdick commented Feb 5, 2021

@cosmicBboy you beat me to it :) ty for pointing me to the extensions docs, I hadn't noticed this feature before

@Zac-HD
Copy link
Contributor

Zac-HD commented Feb 6, 2021

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:
my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))
The Schema, then, could use hypothesis for efficient data generation.

This sounds complicated 😅 (I did sort of look into this when building out the data synthesis functionality but found it way over my head, I'm not sure if there's a straightforward way of parsing complex multi-part strategies)

Ah, yeah. As a Hypothesis core dev I wouldn't try this; complex strategies like data_frames() often have all the relevant structure inside @composite strategy functions and pulling it back out of the closures is totally infeasible.


A better approach, at least for element-wise checks, would be to get our "efficient filter rewriting" HypothesisWorks/hypothesis#2701 done - I think it would be reasonably simple to support numeric bounds and string regex patterns for an initial pass, and that would probably cover a lot of your use-cases.

@cosmicBboy
Copy link
Collaborator

Thanks for the pointers @Zac-HD! Also, just wanted to say I'm a big fan of hypothesis, it's made a huge difference in the way I test DS/ML code (indeed, pandera strategies simply wraps hypothesis for convenience)

The current implementation of check strategy chaining in pandera does heavily use filter, really as a first pass to offer this functionality. I'll look into using composite and map more to make strategy chaining more efficient

@crypdick
Copy link
Author

crypdick commented Feb 9, 2021

@cosmicBboy the solution you posted seems to break down when used with indexes. In particular, if you edit the "img_url": pa.Column(...) part to

    index=pa.Index(
        pa.String,
        allow_duplicates=False,
        nullable=False,
        name="img_url",
        checks=[
            is_valid_img_url,
            is_not_empty_series,
        ]
    ),

Then run my_schema.strategy().example() we get a ValueError: Length mismatch: Expected axis has 7 elements, new values have 0 elements. However, when we use it in @given it works fine.

@cosmicBboy
Copy link
Collaborator

@crypdick do you have a stacktrace of the error? this is definitely a bug

@crypdick
Copy link
Author

crypdick commented Feb 9, 2021

@cosmicBboy sure, here you go:

  File "/home/richard/.config/JetBrains/PyCharm2021.1/scratches/scratch_2.py", line 85, in <module>
    print(multihot_dataset_schema.strategy().example())
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 319, in example
    example_generating_inner_function()
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 307, in example_generating_inner_function
    @settings(
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/core.py", line 1163, in wrapped_test
    raise the_error_hypothesis_found
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandera/strategies.py", line 117, in set_pandas_index
    df_or_series.index = index
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 5154, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 564, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 226, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 6 elements, new values have 0 elements

@cosmicBboy
Copy link
Collaborator

hey @crypdick #410 should fix the issue! there was a bug leading to the length mismatch where the generated df and index would not have the same length.

Will be merging into master probably by tomorrow. In the meantime, you can circumvent this bug by providing a value for the size parameter: my_schema.strategy(size=<int>).example()

@crypdick
Copy link
Author

Sounds great! Cc @daavidstein

@cosmicBboy
Copy link
Collaborator

@crypdick let me know if you come across any other problems relating to this issue! we can reopen it if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants