Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache significantly slows down black due to pathlib #1950

Closed
asottile opened this issue Feb 1, 2021 · 5 comments · Fixed by #1953
Closed

cache significantly slows down black due to pathlib #1950

asottile opened this issue Feb 1, 2021 · 5 comments · Fixed by #1953
Labels
C: performance Black is too slow. Or too fast. T: bug Something isn't working

Comments

@asottile
Copy link
Contributor

asottile commented Feb 1, 2021

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

With a sufficiently large number of runs of black, the cache accumulates pretty rapidly and causes significant slowdown compared to uncached runs:

uncached

$ rm -rf t; time XDG_CACHE_HOME=$PWD/t black t.py
reformatted t.py
All done! ✨ 🍰 ✨
1 file reformatted.

real	0m0.295s
user	0m0.237s
sys	0m0.036s

cached

$ time black t.py
All done! ✨ 🍰 ✨
1 file left unchanged.

real	0m2.115s
user	0m1.999s
sys	0m0.089s

The cache in question isn't ~that large:

$ du -hs ~/.cache/black/20.8b1/*.pickle
3.8M	/home/asottile/.cache/black/20.8b1/cache.-.88.1.0.pickle
16K	/home/asottile/.cache/black/20.8b1/Grammar3.8.5.final.0.pickle
4.0K	/home/asottile/.cache/black/20.8b1/PatternGrammar3.8.5.final.0.pickle

The reason for all the slowdown appears to be because pathlib is slow -- I've attached a pstats and svg of this:

profile graph

This was generated using this procedure

python -m cProfile -o out.pstats -m black t.py
gprof2dot -z __init__:6604:patched_main out.pstats | dot -Tsvg -o out.svg  # gprof2dot from `yelp-gprof2dot`, `dot` from graphviz

out.zip (contains svg and pstats file)

Expected behavior A clear and concise description of what you expected to happen.

The cache shouldn't make the execution significantly slower

Environment (please complete the following information):

  • Version: [e.g. master] 20.8b1
  • OS and Python version: [e.g. Linux/Python 3.7.4rc1] 3.8.5 on ubuntu 20.04

Does this bug also happen on master? To answer this, you have two options:

yes, if I build up a new cache using the current default branch it reproduces as well

Additional context Add any other context about the problem here.

Removing pathlib from the Cache makes the cache serialization cost ~essentially zero -- will follow up with a demonstration patch

@asottile asottile added the T: bug Something isn't working label Feb 1, 2021
@asottile
Copy link
Contributor Author

asottile commented Feb 1, 2021

here is a patch removing pathlib from the Cache

$ git diff --staged | cat
diff --git a/src/black/__init__.py b/src/black/__init__.py
index 9034bf6..e5736f0 100644
--- a/src/black/__init__.py
+++ b/src/black/__init__.py
@@ -93,7 +93,7 @@ Transformer = Callable[["Line", Collection["Feature"]], Iterator["Line"]]
 Timestamp = float
 FileSize = int
 CacheInfo = Tuple[Timestamp, FileSize]
-Cache = Dict[Path, CacheInfo]
+Cache = Dict[str, CacheInfo]
 out = partial(click.secho, bold=True, err=True)
 err = partial(click.secho, fg="red", err=True)
 
@@ -724,7 +724,8 @@ def reformat_one(
             if write_back not in (WriteBack.DIFF, WriteBack.COLOR_DIFF):
                 cache = read_cache(mode)
                 res_src = src.resolve()
-                if res_src in cache and cache[res_src] == get_cache_info(res_src):
+                res_src_s = str(res_src)
+                if res_src_s in cache and cache[res_src_s] == get_cache_info(res_src):
                     changed = Changed.CACHED
             if changed is not Changed.CACHED and format_file_in_place(
                 src, fast=fast, write_back=write_back, mode=mode
@@ -6781,8 +6782,8 @@ def filter_cached(cache: Cache, sources: Iterable[Path]) -> Tuple[Set[Path], Set
     """
     todo, done = set(), set()
     for src in sources:
-        src = src.resolve()
-        if cache.get(src) != get_cache_info(src):
+        res_src = src.resolve()
+        if cache.get(str(res_src)) != get_cache_info(res_src):
             todo.add(src)
         else:
             done.add(src)
@@ -6794,7 +6795,7 @@ def write_cache(cache: Cache, sources: Iterable[Path], mode: Mode) -> None:
     cache_file = get_cache_file(mode)
     try:
         CACHE_DIR.mkdir(parents=True, exist_ok=True)
-        new_cache = {**cache, **{src.resolve(): get_cache_info(src) for src in sources}}
+        new_cache = {**cache, **{str(src.resolve()): get_cache_info(src) for src in sources}}
         with tempfile.NamedTemporaryFile(dir=str(cache_file.parent), delete=False) as f:
             pickle.dump(new_cache, f, protocol=4)
         os.replace(f.name, cache_file)

and the associated profile from this

out2

out2.zip

@JelleZijlstra
Copy link
Collaborator

I see, so the issue is that we pickle Path objects instead of strings in the pickled cache? That does seem like useless overhead; I'd merge a PR making that change.

@chriskuehl
Copy link

++ this would be a big help, thanks @asottile! We recently discovered that our black runs were very slow under Jenkins due to a ~200 MB cache pickle:

>>> contents = pickle.load(open("19.10b0/cache.2,6.131.1.0.pickle", "rb"))

>>> type(contents)
<class 'dict'>

>>> len(contents)
1777951

>>> pprint(list(contents.items())[:5])
[(PosixPath('/ephemeral/jenkins/workspace/xxxxxxxx-52804/aaaaaaaa.py'),
  (1611104503.533281, 0)),
 (PosixPath('/ephemeral/jenkins/workspace/xxxxxxxx-52804/bbbbbbbb.py'),
  (1611104503.537281, 0)),
 (PosixPath('/ephemeral/jenkins/workspace/xxxxxxxx-52804/cccccccc.py'),
  (1611104503.4972813, 0)),
 (PosixPath('/ephemeral/jenkins/workspace/xxxxxxxx-52804/dddddddd.py'),
  (1611104503.557281, 0)),
 (PosixPath('/ephemeral/jenkins/workspace/xxxxxxxx-52804/eeeeeeee.py'),
  (1611104503.5212812, 576))]

Running black on a single file was taking 30 seconds until we cleared this cache (70+ minutes total when run using pre-commit on a large repo, compared to ~2 minutes after the cache was removed).

@JelleZijlstra
Copy link
Collaborator

@chriskuehl some related thoughts: The Black cache on my company's CI instances is <1 MB on a repo with millions of lines, so I was surprised to see a 200 MB cache. I guess it's because the xxxxxxxx is some uuid-like string and every Jenkins run generates new cache entries for you, and your cache will essentially grow forever with every run. Storing strings instead of Path objects will therefore help, but your cache is still going to keep linearly growing forever. Our CI setup doesn't have that issue because we always use the same paths, which is nice because it means that most Black runs in our CI mostly hit the cache.

You could fix the unbounded growth on your side by configuring Black to use a fresh cache for every Jenkins run, or by just periodically deleting the cache. I'm trying to think of something that could be done in Black itself, but not sure there is much we can do.

@chriskuehl
Copy link

@JelleZijlstra you're right, it's because our workspace names are unique and generally only used once. We're running black without the cache now on CI to work around this (setting XDG_CACHE_HOME to a location inside the temporary workspace), but it was sort of counter-intuitive for us that a populated cache leads to slower runs rather than faster ones.

From our perspective, it would be nice if black cleaned up its cache once it reached a size where loading the cache takes 30+ seconds, but I'm not sure how easy that is to implement or if that will affect other use-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: performance Black is too slow. Or too fast. T: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants