Hypothesis's pandas extension is slow compared to pydantic #3701

tmcclintock · 2023-07-17T15:39:30Z

This is a head-to-head comparison between pydantic and the pandas extension in hypothesis, where we are generating a single-row dataframe with hypothesis (since pydantic only generates one model instance at a time).

"""Test hypothesis with pydantic basemodel."""

from hypothesis import given, strategies
from hypothesis.extra import pandas as hyp_pandas

from pydantic import BaseModel


class PydanticExample(BaseModel):

    feature1: int
    feature2: float
    feature3: str



df_strat = hyp_pandas.data_frames(
    columns=[
        hyp_pandas.column("feature1", dtype=int),
        hyp_pandas.column("feature2", dtype=float),
        hyp_pandas.column("feature3", dtype=str),
    ],
    index=hyp_pandas.range_indexes(min_size=1, max_size=1)
)


@given(strategies.builds(PydanticExample))
def test_pydantic_example(example):
    PydanticExample.validate(example)


@given(df_strat)
def test_hypothesis_pandas_example(example):
    assert "feature1" in example
    assert len(example) == 1

I ran it with: pytest test_speed.py --durations=0 -v

Takeaway: hypothesis is 9-10x slower than pydantic as seen by this output:

============================================================ slowest durations =============================================================
0.78s call     test_speed.py::test_hypothesis_pandas_example
0.08s call     test_speed.py::test_pydantic_example

For background, this issue was originally brought up in this issue in the pandera project. pandera (basically) has a thin wrapper around the pandas.data_frames() method from hypothesis.

My questions:

Do maintainers have a theory on what the cause of this speed discrepancy might be?
Is there appetite for optimizing this logic to make it faster? I'm happy to help where I can btw.

Thank you!

The text was updated successfully, but these errors were encountered:

Zac-HD · 2023-07-17T16:04:16Z

There's just way way more logic involved in generating a dataframe, due to the possibility of having many rows, interacting constraints, etc. By contrast the pydantic model case is three built-in types, one function call, and no dtype conversions or validation or anything.
Yes, I'd be absolutely delighted to accept PRs for performance improvements - only caveat is if it makes future maintenance substantially harder.

Zac-HD · 2023-07-17T16:06:01Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hypothesis's pandas extension is slow compared to pydantic #3701

Hypothesis's pandas extension is slow compared to pydantic #3701

tmcclintock commented Jul 17, 2023

Zac-HD commented Jul 17, 2023

Zac-HD commented Jul 17, 2023

Zac-HD commented Nov 5, 2023

Hypothesis's pandas extension is slow compared to pydantic #3701

Hypothesis's pandas extension is slow compared to pydantic #3701

Comments

tmcclintock commented Jul 17, 2023

Zac-HD commented Jul 17, 2023

Zac-HD commented Jul 17, 2023

Zac-HD commented Nov 5, 2023