Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

black performance scales non-linearly with the number of files to process #1951

Open
asottile opened this issue Feb 1, 2021 · 5 comments
Open
Labels
C: performance Black is too slow. Or too fast. T: bug Something isn't working

Comments

@asottile
Copy link
Contributor

asottile commented Feb 1, 2021

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

I suspect this is related to #1950 -- or at least exacerbated by it

setup

rm -rf src
mkdir src
touch src/test_{1..10000}.py

1000 files

$ rm -rf t; time XDG_CACHE_HOME=$PWD/t black src/test_{1..1000}.py
All done! ✨ 🍰 ✨
1000 files left unchanged.

real	0m1.504s
user	0m1.467s
sys	0m0.195s

10000 files

$ rm -rf t; time XDG_CACHE_HOME=$PWD/t black src/test_{1..10000}.py
All done! ✨ 🍰 ✨
10000 files left unchanged.

real	0m55.354s
user	0m54.547s
sys	0m2.550s

I expect the second command to be approximately 10x slower than the first, however it's almost 30-40x worse

A guess is black is writing the cache for every single file and should instead batch it per run

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • Version: [e.g. master] 71117e7
  • OS and Python version: [e.g. Linux/Python 3.7.4rc1] 3.8.5 on ubuntu 20.04

Does this bug also happen on master? To answer this, you have two options:

yes

Additional context Add any other context about the problem here.

@asottile asottile added the T: bug Something isn't working label Feb 1, 2021
@asottile asottile changed the title black performance scales non-lineraly with the number of files to process black performance scales non-linearly with the number of files to process Feb 1, 2021
@JelleZijlstra
Copy link
Collaborator

Good find! From looking at the code it appears we shouldn't write to the cache multiple times per run, so there must be some other superlinear behavior going on. Does it reproduce with your patch for #1950?

@asottile
Copy link
Contributor Author

asottile commented Feb 1, 2021

oh yeah I should have checked, yeah it does so my guess at the cache being at fault is probably incorrect

@ichard26 ichard26 added the C: performance Black is too slow. Or too fast. label Feb 3, 2021
@zsol
Copy link
Collaborator

zsol commented May 25, 2021

I had some spare time and after some profiling it looks like ensure_future is being called 10x more for 3x the amount of input. The vast majority of the calls are coming from the implementation of asyncio.wait because we're spamming it in a tight loop from https://github.com/psf/black/blob/main/src/black/__init__.py#L642

@zsol
Copy link
Collaborator

zsol commented May 25, 2021

In fact, adding an await asyncio.sleep(0.1) at the end of that loop fixes this issue and makes performance scale linearly with number of input files

@zsol
Copy link
Collaborator

zsol commented May 25, 2021

I wonder if we should just use asyncio.as_completed instead of the hand-rolled batching/waiting logic in https://github.com/psf/black/blob/main/src/black/__init__.py#L626-L659. Or maybe that would result in too much awaiting?

If so, we could bring out the big guns and use https://aioitertools.omnilib.dev/en/latest/api.html#aioitertools.asyncio.as_completed and https://aioitertools.omnilib.dev/en/latest/api.html#aioitertools.more_itertools.chunked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: performance Black is too slow. Or too fast. T: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants