Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypothesis's pandas extension is slow compared to pydantic #3701

Closed
tmcclintock opened this issue Jul 17, 2023 · 3 comments
Closed

Hypothesis's pandas extension is slow compared to pydantic #3701

tmcclintock opened this issue Jul 17, 2023 · 3 comments
Labels
performance go faster! use less memory!

Comments

@tmcclintock
Copy link

This is a head-to-head comparison between pydantic and the pandas extension in hypothesis, where we are generating a single-row dataframe with hypothesis (since pydantic only generates one model instance at a time).

"""Test hypothesis with pydantic basemodel."""

from hypothesis import given, strategies
from hypothesis.extra import pandas as hyp_pandas

from pydantic import BaseModel


class PydanticExample(BaseModel):

    feature1: int
    feature2: float
    feature3: str



df_strat = hyp_pandas.data_frames(
    columns=[
        hyp_pandas.column("feature1", dtype=int),
        hyp_pandas.column("feature2", dtype=float),
        hyp_pandas.column("feature3", dtype=str),
    ],
    index=hyp_pandas.range_indexes(min_size=1, max_size=1)
)


@given(strategies.builds(PydanticExample))
def test_pydantic_example(example):
    PydanticExample.validate(example)


@given(df_strat)
def test_hypothesis_pandas_example(example):
    assert "feature1" in example
    assert len(example) == 1

I ran it with: pytest test_speed.py --durations=0 -v

Takeaway: hypothesis is 9-10x slower than pydantic as seen by this output:

============================================================ slowest durations =============================================================
0.78s call     test_speed.py::test_hypothesis_pandas_example
0.08s call     test_speed.py::test_pydantic_example

For background, this issue was originally brought up in this issue in the pandera project. pandera (basically) has a thin wrapper around the pandas.data_frames() method from hypothesis.

My questions:

  1. Do maintainers have a theory on what the cause of this speed discrepancy might be?
  2. Is there appetite for optimizing this logic to make it faster? I'm happy to help where I can btw.

Thank you!

@Zac-HD
Copy link
Member

Zac-HD commented Jul 17, 2023

  1. There's just way way more logic involved in generating a dataframe, due to the possibility of having many rows, interacting constraints, etc. By contrast the pydantic model case is three built-in types, one function call, and no dtype conversions or validation or anything.
  2. Yes, I'd be absolutely delighted to accept PRs for performance improvements - only caveat is if it makes future maintenance substantially harder.

@Zac-HD Zac-HD added the performance go faster! use less memory! label Jul 17, 2023
@Zac-HD
Copy link
Member

Zac-HD commented Jul 17, 2023

See also unionai-oss/pandera#404

@Zac-HD
Copy link
Member

Zac-HD commented Nov 5, 2023

Closing this because there's no planned action on our end.

@Zac-HD Zac-HD closed this as completed Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance go faster! use less memory!
Projects
None yet
Development

No branches or pull requests

2 participants