Track mypy performance changes automatically #14187

JukkaL · 2022-11-25T16:40:46Z

Currently it's difficult to find the root cause of a performance regression. We could automatically profile each mypy commit and generate a report with historical performance information, similar to how we track the performance of mypyc (e.g. https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/summary-main.md).

Add another benchmark to the mypyc benchmarks runner that type checks a fixed version of mypy using the current compiled mypy.

The benchmark will live in https://github.com/mypyc/mypyc-benchmarks.

JukkaL · 2022-11-25T16:50:46Z

I've started working on this. Some details:

We'll have a vendored copy of mypy in the mypyc-benchmarks repo so that the workload is always the same. This also makes it easy to make minor tweaks to e.g. fix errors when compiled with a newer mypy.
We'll need to pin PEP 561 dependencies such as pytest for repeatable results.
It would be nice to generate historical data to find older regressions that we might still want to fix, but it's not clear whether we can properly type check a recent mypy with very old mypy versions.
When updating to a newer Python or Ubuntu version there will be a discontinuity. We can adjust the old results with a relative performance factor to preserve a continuous serious of measurements -- this is similar to what we've done before with mypyc benchmark results. Re-generating all historical benchmark results is too time consuming.

hauntsaninja · 2022-11-28T06:33:31Z

Related hauntsaninja/mypy_primer#30

JukkaL · 2022-11-28T11:31:20Z

Additional context: The mypyc benchmarks are run on a dedicated host that has been tweaked for precise performance measurements (e.g. no turbo boost, automatic updates disabled). I hope that we will able to detect performance regressions below 0.5% (stretch goal: detect a 0.1% regression).

JukkaL · 2022-11-30T17:50:30Z

Based on initial results, detecting regressions that are under 0.5% seems hard. A 1.0% regression is probably clearly visible, and a 0.5% regression might be visible. Perhaps we can improve precision by tweaking the measurement logic (e.g. by dropping outliers), but it's not clear yet. I'll perform a few additional experiments.

hauntsaninja · 2022-11-30T19:45:42Z

@JukkaL I think it would be helpful to type check repositories other than mypy. Here are the 20 slowest repositories to type check from mypy_primer (which also contains the exact commands and deps used).

https://github.com/pandas-dev/pandas
https://github.com/sympy/sympy
https://github.com/home-assistant/core
https://github.com/graphql-python/graphql-core
https://github.com/MaterializeInc/materialize
https://github.com/canonical/cloud-init
https://github.com/arviz-devs/arviz
https://github.com/pandas-dev/pandas-stubs
https://github.com/apache/spark
https://github.com/python/mypy
https://github.com/common-workflow-language/cwltool
https://github.com/Rapptz/discord.py
https://github.com/trailofbits/manticore
https://github.com/streamlit/streamlit
https://github.com/astropenguin/xarray-dataclasses
https://github.com/google/jax
https://github.com/python-poetry/poetry
https://github.com/urllib3/urllib3
https://github.com/pypa/pip
https://github.com/psycopg/psycopg

If I had to pick five from these, I'd maybe do:

https://github.com/pandas-dev/pandas
https://github.com/sympy/sympy
https://github.com/home-assistant/core
https://github.com/Rapptz/discord.py
https://github.com/FasterSpeeding/Tanjun

A5rocks · 2022-11-30T23:39:52Z

Based on initial results, detecting regressions that are under 0.5% seems hard. A 1.0% regression is probably clearly visible, and a 0.5% regression might be visible. Perhaps we can improve precision by tweaking the measurement logic (e.g. by dropping outliers), but it's not clear yet. I'll perform a few additional experiments.

I notice that wall time is used. Maybe CPU instruction count would have less noise? I know it's what stuff like Rust's perf site uses by default (https://perf.rust-lang.org).

See also stuff like rust-lang/rust#104646 (comment)

I'm not sure it would be as useful given that as I understand it some amount of mypy time is spent in IO.

JukkaL · 2022-12-07T15:49:04Z

I'm not sure it would be as useful given that as I understand it some amount of mypy time is spent in IO.

Mypy does some IO and it would be good to track regressions related to IO as well, so this doesn't sound ideal for mypy.

I think it would be helpful to type check repositories other than mypy.

Agreed, this would be nice! I'm focusing on mypy at first, but we could add some extra repositories once the whole thing works reliably for mypy.

JukkaL · 2022-12-07T15:52:09Z

I tried a few different things to get more precise results, but I'm still stuck at about 1% noise floor. It's still enough to detect major performance regressions. I'm now collecting historical data, over the most recent 1000 mypy commits or so. It will take about two weeks of calendar time to complete (if everything goes smoothly).

I will also experiment with collecting performance data about interpreted mypy runs, in the hope of getting more accurate timings.

JukkaL · 2022-12-19T11:25:03Z

I now have collected performance data for about one year of commits, though the most recent commits are still missing.

Here's the data: https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/benchmarks/mypy_self_check.md

Performance has regressed by about 50% over a year, so it's time to focus on fixing the regressions! I hope that in the future we can keep the overall level of slowdown under 10% in any year (and preferably under 5%) so that perceived mypy performance stays at least roughly the same over time, when accounting for hardware improvements. Hardware has improved about 5-10% per year for single-threaded workloads in the last 10 years, on average.

The noise floor is about 1.5%. Perhaps with some smoothing we can bring it closer to 1.0%. Anyway, it's already good enough to find major regressions and improvements.

I started looking at reducing the impact of 48c4a47 some time ago. This was a previously known regression that got me interested in finding other regressions like that.

JukkaL · 2022-12-19T16:26:43Z

Here are some optimizations based on the performance data: #14316

JukkaL · 2022-12-20T13:58:43Z

More optimizations:

A5rocks · 2022-12-20T14:06:12Z

These changes are very nice! Would it be possible to get a preview of whether a PR introduces a performance regression before merging it (a la mypy_primer)? I'm not sure if that's planned or not, but IMO would be useful.

JukkaL · 2022-12-20T14:31:27Z

Would it be possible to get a preview of whether a PR introduces a performance regression before merging it (a la mypy_primer)?

I agree that this would be useful. I don't have immediate plans to implement this, since this would likely require additional dedicated runner host to avoid noisy results on shared infra.

A5rocks · 2022-12-22T02:22:35Z

I was looking at some other topics and found scala/scala-dev#338

Presumably most viable things listed there have already been tried (given e.g. turbo boost has already been disabled), but it's incredibly comprehensive and it would be great to catch even smaller regressions!

JukkaL · 2022-12-28T12:23:57Z

I'm closing this now, as we have basic tracking of performance changes. I'll create another issue to track further improvements. In particular, it would be great to get the noise floor down to below 0.5%.

JukkaL added topic-developer Issues relevant to mypy developers performance labels Nov 25, 2022

JukkaL self-assigned this Nov 25, 2022

JukkaL closed this as completed Dec 28, 2022

JukkaL mentioned this issue Dec 28, 2022

Improve precision of mypy performance tracking #14358

Open

jenstroeger mentioned this issue Jun 7, 2023

Consider tracking performance across package releases. jenstroeger/python-package-template#563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track mypy performance changes automatically #14187

Track mypy performance changes automatically #14187

JukkaL commented Nov 25, 2022 •

edited

JukkaL commented Nov 25, 2022

hauntsaninja commented Nov 28, 2022

JukkaL commented Nov 28, 2022

JukkaL commented Nov 30, 2022

hauntsaninja commented Nov 30, 2022

A5rocks commented Nov 30, 2022

JukkaL commented Dec 7, 2022

JukkaL commented Dec 7, 2022

JukkaL commented Dec 19, 2022

JukkaL commented Dec 19, 2022

JukkaL commented Dec 20, 2022

A5rocks commented Dec 20, 2022 •

edited

JukkaL commented Dec 20, 2022

A5rocks commented Dec 22, 2022 •

edited

JukkaL commented Dec 28, 2022

Track mypy performance changes automatically #14187

Track mypy performance changes automatically #14187

Comments

JukkaL commented Nov 25, 2022 • edited

JukkaL commented Nov 25, 2022

hauntsaninja commented Nov 28, 2022

JukkaL commented Nov 28, 2022

JukkaL commented Nov 30, 2022

hauntsaninja commented Nov 30, 2022

A5rocks commented Nov 30, 2022

JukkaL commented Dec 7, 2022

JukkaL commented Dec 7, 2022

JukkaL commented Dec 19, 2022

JukkaL commented Dec 19, 2022

JukkaL commented Dec 20, 2022

A5rocks commented Dec 20, 2022 • edited

JukkaL commented Dec 20, 2022

A5rocks commented Dec 22, 2022 • edited

JukkaL commented Dec 28, 2022

JukkaL commented Nov 25, 2022 •

edited

A5rocks commented Dec 20, 2022 •

edited

A5rocks commented Dec 22, 2022 •

edited