Parse Hypothesis strageies for schema inference, validation, and synthesis #399

crypdick · 2021-02-03T18:16:36Z

Is your feature request related to a problem? Please describe.
Pandera's new synthesis feature is very attractive, but is broken if the constraints are non-trivial (e.g. pa.Check.str_matches(complicated_regex)). Related discussion here.

Describe the solution you'd like
Ideally, we could use complex schema for both validation and synthesis. In order for that to be possible, Pandera needs to generate data more efficiently, which means not using rejection sampling. Luckily, Hypothesis has solved the efficient synthesis problem, but Hypothesis strategies can't be used for validation.

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:

my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))

The Schema, then, could use hypothesis for efficient data generation.

Describe alternatives you've considered
I've tried disabling Hypothesis's health checks but Hypothesis literally ran out of random bytes to parse into your dataframe, and there's not really anything that [Hypothesis] can do about that

Additional context
We were attracted to Pandera over Great Expectations due to the ability to create hypothesis strategies directly from schemas. This saves us the labor of maintaining separate validation schemas and Hypothesis strategies.

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-02-03T19:00:45Z

hey @crypdick, yes getting data synthesis right is pretty challenging and I'd like to smooth out the rough edges. Just so I get a better sense of the issue, can you provide an example schema that's causing the health check issues?

crypdick · 2021-02-03T19:26:20Z

@cosmicBboy sure! I had to obfuscate sensitive details but hopefully this gives you a rough idea.

import re

import pandera as pa

# simplified for security
IMG_URL_REGEX = re.compile(
    r"""
    ^s3://   # prefix
    [-a-z0-9]+/
    [A-Za-z0-9-]+/
    20[1-2][0-9]+/  # years 2010-2029
    [A-Za-z0-9-_,.]+/  # including _, commas, period
    v[0-9]/  # versions v0-v9
    2[4-5]/ 
    [0-9]+/
    [0-9]+  # file name
    [.]png$  # extension, escape dot
    """,
    re.X,
)


valid_ids = {str(i) for i in range(1, 1000)}   # placeholder for more complicated logic

is_valid_img_url = pa.Check.str_matches(IMG_URL_REGEX)
is_not_empty_series = pa.Check(lambda series_: len(series_) > 0, name="Series not empty")
is_valid_id = pa.Check.isin(valid_ids)

# class IDs joined by |
# this later gets filtered to ensure each element is a valid ID (invalid IDs throw Exceptions downstream)
numeric_str_delim_by_pipes = re.compile(
    r"""[0-9]{1,4}  # in reality, this is more complicated
        (\|[0-9]{1,4})*  # 0 or more IDs
        """,
    re.X,
)

grouped_by_s3uri_schema = pa.DataFrameSchema(
    columns={
        "img_url": pa.Column(
            pa.String,
            allow_duplicates=False,
            nullable=False,
            checks=[is_valid_img_url, is_not_empty_series],  # most restrictive check first
        ),
        "labels": pa.Column(
            pa.String,
            checks=[
                pa.Check.str_matches(numeric_str_delim_by_pipes),
                pa.Check(
                    lambda str_delim_pipes: all([id_ in valid_ids for id_ in str_delim_pipes.split("|")]),
                    element_wise=True,
                ),
            ],
        ),
    }
)

cosmicBboy · 2021-02-03T23:04:12Z

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:

my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))

The Schema, then, could use hypothesis for efficient data generation.

This sounds complicated 😅 (I did sort of look into this when building out the data synthesis functionality but found it way over my head, I'm not sure if there's a straightforward way of parsing complex multi-part strategies)

Note that rejection sampling only occurs after the first check in the checks list, so inefficiencies in the first check would not be due to rejection sampling. Testing the code out locally it looks like img_url checks are leading to data_too_large and large_base_example, and labels checks are leading to filter_too_much issues.

I think it's up to pandera to make data generation as easy as possible and providing users with an interface to customize data synthesis using the hypothesis API when built-in checks/strategies aren't up to snuff.

Another potential solution: custom checks and strategies

The extensions API was designed to give users access to the hypothesis API to make custom data synthesis strategies. This lets you register custom checks into the Check namespace and associate it with a data-generation strategy. You can effectively combine multiple checks into one, or implement any hypothesis strategy you want with hypothesis (with a little bit of ceremony imposed by pandera).

You can do something like:

import re
from typing import Optional

import hypothesis
import hypothesis.strategies as st
import pandas as pd
import pandera as pa
import pandera.extensions as extensions


# simplified for security
IMG_URL_REGEX = re.compile(
    r"""
    ^s3://   # prefix
    [-a-z0-9]+/
    [A-Za-z0-9-]+/
    20[1-2][0-9]+/  # years 2010-2029
    [A-Za-z0-9-_,.]+/  # including _, commas, period
    v[0-9]/  # versions v0-v9
    2[4-5]/ 
    [0-9]+/
    [0-9]+  # file name
    [.]png$  # extension, escape dot
    """,
    re.X,
)

NUMERIC_STR_DELIM_REGEX = re.compile(
    r"""[0-9]{1,4}  # in reality, this is more complicated
        (\|[0-9]{1,4})*  # 0 or more IDs
        """,
    re.X,
)

VALID_IDS = [str(i) for i in range(1, 1000)]


# Define custom url strategy and check
def url_strategy(
    pandas_dtype: pa.PandasDtype,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    url_regex,
):
    if strategy is None:
        # replace this with more efficient Hypothesis strategy if desired
        return st.from_regex(url_regex, fullmatch=True)
    raise pa.errors.BaseStrategyOnlyError(
        "'url_strategy' must be a base strategy"
    )


@extensions.register_check_method(
    statistics=["url_regex"],
    strategy=url_strategy,
    supported_types=pd.Series,
)
def valid_url(pandas_obj, *, url_regex):
    """Url regex check."""
    return pandas_obj.str.match(url_regex, na=False)


# Define custom label strategy and check
def labels_strategy(
    pandas_dtype: pa.PandasDtype,
    strategy: Optional[st.SearchStrategy] = None,
    *,
    valid_ids,
):
    if strategy is None:
        # replace this with more efficient Hypothesis strategy if desired
        return st.lists(
            st.sampled_from(VALID_IDS), unique=True, min_size=1
        ).map("|".join)
    raise pa.errors.BaseStrategyOnlyError(
        "'labels_strategy' must be a base strategy"
    )


@extensions.register_check_method(
    statistics=["valid_ids"],
    strategy=labels_strategy,
    supported_types=pd.Series,
)
def valid_labels(pandas_obj, *, valid_ids):
    # combines the regex match check and the valid_ids check
    valid_ids = set(valid_ids)
    return pandas_obj.str.match(NUMERIC_STR_DELIM_REGEX) & (
        pandas_obj.map(lambda x: all(id_ in valid_ids for id_ in x.split("|")))
    )


schema = pa.DataFrameSchema(
    columns={
        "img_url": pa.Column(
            pa.String,
            allow_duplicates=False,
            nullable=False,
            checks=[
                pa.Check.valid_url(url_regex=IMG_URL_REGEX),
                pa.Check(
                    lambda series_: len(series_) > 0, name="Series not empty"
                ),
            ],
        ),
        "labels": pa.Column(
            pa.String,
            checks=pa.Check.valid_labels(valid_ids=VALID_IDS),
        ),
    }
)


@hypothesis.given(schema.strategy(size=5))
def test_schema(df):
    print(df)
    # test something

Phew! thanks for bearing with me. All that said, I did find some inefficiencies in the way the dataframe strategy was being constructed: #400 << this PR should actually make the img_url health checks go away and the code as you had it in your sample code snippet should work. Feel free to pull the master branch and try it out locally!

For the label generation code I'd recommend using the labels_strategy + valid_labels approach that I implemented above, since it generates data by sampling the set of valid ids instead of from the regex, which guarantees that | delimited strings contain valid ids

cosmicBboy · 2021-02-03T23:14:08Z

Other types of solutions that would be fairly heavy lifts i.e. going into the guts of the pandera.strategies module:

supporting composite strategies as a way of chaining multiple checks together.
providing an interface for "multi-check resolution", which basically aggregates the statistics of multiple checks and constructs a single strategy from them.

crypdick · 2021-02-05T15:42:07Z

@cosmicBboy Tyvm for the detailed answer! When I copy-paste your code, I get a TypeError: url_strategy() missing 1 required keyword-only argument: 'check_regex'.

cosmicBboy · 2021-02-05T16:53:40Z

woops! just edited the code snippet , should work now

crypdick · 2021-02-05T17:10:16Z

@cosmicBboy you beat me to it :) ty for pointing me to the extensions docs, I hadn't noticed this feature before

Zac-HD · 2021-02-06T11:45:35Z

It would be great if Pandera could parse a Hypothesis data_frames strategy into Schemas for validation:
my_schema = pa.infer_schema(hypothesis.extra.pandas.data_frames(...))
The Schema, then, could use hypothesis for efficient data generation.

This sounds complicated 😅 (I did sort of look into this when building out the data synthesis functionality but found it way over my head, I'm not sure if there's a straightforward way of parsing complex multi-part strategies)

Ah, yeah. As a Hypothesis core dev I wouldn't try this; complex strategies like data_frames() often have all the relevant structure inside @composite strategy functions and pulling it back out of the closures is totally infeasible.

A better approach, at least for element-wise checks, would be to get our "efficient filter rewriting" HypothesisWorks/hypothesis#2701 done - I think it would be reasonably simple to support numeric bounds and string regex patterns for an initial pass, and that would probably cover a lot of your use-cases.

cosmicBboy · 2021-02-07T14:59:24Z

Thanks for the pointers @Zac-HD! Also, just wanted to say I'm a big fan of hypothesis, it's made a huge difference in the way I test DS/ML code (indeed, pandera strategies simply wraps hypothesis for convenience)

The current implementation of check strategy chaining in pandera does heavily use filter, really as a first pass to offer this functionality. I'll look into using composite and map more to make strategy chaining more efficient

crypdick · 2021-02-09T16:15:42Z

@cosmicBboy the solution you posted seems to break down when used with indexes. In particular, if you edit the "img_url": pa.Column(...) part to

    index=pa.Index(
        pa.String,
        allow_duplicates=False,
        nullable=False,
        name="img_url",
        checks=[
            is_valid_img_url,
            is_not_empty_series,
        ]
    ),

Then run my_schema.strategy().example() we get a ValueError: Length mismatch: Expected axis has 7 elements, new values have 0 elements. However, when we use it in @given it works fine.

cosmicBboy · 2021-02-09T16:37:16Z

@crypdick do you have a stacktrace of the error? this is definitely a bug

crypdick · 2021-02-09T16:57:41Z

@cosmicBboy sure, here you go:

  File "/home/richard/.config/JetBrains/PyCharm2021.1/scratches/scratch_2.py", line 85, in <module>
    print(multihot_dataset_schema.strategy().example())
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 319, in example
    example_generating_inner_function()
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/strategies/_internal/strategies.py", line 307, in example_generating_inner_function
    @settings(
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/hypothesis/core.py", line 1163, in wrapped_test
    raise the_error_hypothesis_found
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandera/strategies.py", line 117, in set_pandas_index
    df_or_series.index = index
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 5154, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 564, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/home/richard/src/DENDRA/venv/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 226, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 6 elements, new values have 0 elements

cosmicBboy · 2021-02-12T01:44:22Z

hey @crypdick #410 should fix the issue! there was a bug leading to the length mismatch where the generated df and index would not have the same length.

Will be merging into master probably by tomorrow. In the meantime, you can circumvent this bug by providing a value for the size parameter: my_schema.strategy(size=<int>).example()

crypdick · 2021-02-12T13:24:05Z

Sounds great! Cc @daavidstein

cosmicBboy · 2021-02-13T18:54:19Z

@crypdick let me know if you come across any other problems relating to this issue! we can reopen it if needed

crypdick added the enhancement New feature or request label Feb 3, 2021

cosmicBboy added this to the 0.6.2 release milestone Feb 5, 2021

cosmicBboy mentioned this issue Feb 12, 2021

bugfix: df data synthesis with size=None, fix CI #410

Merged

cosmicBboy closed this as completed in #410 Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

crypdick commented Feb 3, 2021

cosmicBboy commented Feb 3, 2021

crypdick commented Feb 3, 2021

cosmicBboy commented Feb 3, 2021 •

edited

cosmicBboy commented Feb 3, 2021 •

edited

crypdick commented Feb 5, 2021

cosmicBboy commented Feb 5, 2021

crypdick commented Feb 5, 2021

Zac-HD commented Feb 6, 2021

cosmicBboy commented Feb 7, 2021

crypdick commented Feb 9, 2021

cosmicBboy commented Feb 9, 2021

crypdick commented Feb 9, 2021

cosmicBboy commented Feb 12, 2021

crypdick commented Feb 12, 2021

cosmicBboy commented Feb 13, 2021

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

Parse Hypothesis strageies for schema inference, validation, and synthesis #399

Comments

crypdick commented Feb 3, 2021

cosmicBboy commented Feb 3, 2021

crypdick commented Feb 3, 2021

cosmicBboy commented Feb 3, 2021 • edited

Another potential solution: custom checks and strategies

cosmicBboy commented Feb 3, 2021 • edited

crypdick commented Feb 5, 2021

cosmicBboy commented Feb 5, 2021

crypdick commented Feb 5, 2021

Zac-HD commented Feb 6, 2021

cosmicBboy commented Feb 7, 2021

crypdick commented Feb 9, 2021

cosmicBboy commented Feb 9, 2021

crypdick commented Feb 9, 2021

cosmicBboy commented Feb 12, 2021

crypdick commented Feb 12, 2021

cosmicBboy commented Feb 13, 2021

cosmicBboy commented Feb 3, 2021 •

edited

cosmicBboy commented Feb 3, 2021 •

edited