Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay evaluation of re.compile to improve startup time. #6690

Open
chrahunt opened this issue Jul 9, 2019 · 11 comments
Open

Delay evaluation of re.compile to improve startup time. #6690

chrahunt opened this issue Jul 9, 2019 · 11 comments
Labels
state: needs discussion This needs some more discussion type: enhancement Improvements to functionality type: performance Commands take too long to run

Comments

@chrahunt
Copy link
Member

chrahunt commented Jul 9, 2019

What's the problem this feature will solve?

Speed up pip startup time, as motivated by #4768.

Describe the solution you'd like

Globally evaluate calls to re.compile lazily, only executing when needed.

Alternative Solutions

  1. Manually delaying evaluation by refactoring code that uses re.compile at global or class scope to do so as-needed.
  2. Create and use an explicit function for lazy-evaluated regex in our code.

A quick grep in pip/_internal shows 21 calls in our code, out of the 220 total mentioned below, so neither of these are likely to have a large impact.

Additional context

Profiling with cProfile indicates that a non-negligible part of our startup time is due to calls to re.compile. Evaluating it lazily shows some improvement. Just running pip, we see:

Metric Before After
all function calls 295461 216540
total time 0.224 s 0.182 s
calls to re.compile 220 32
time in re.compile 0.064 s 0.024 s

The times will always be machine/hardware-dependent, but it seems worth evaluating as an option just from the reduction in number of function calls.

To reproduce, run

test.sh
#!/bin/sh

set -x
d="$(mktemp -d)"
cd "$d"
python -m cProfile -o pip.cprof -m pip >/dev/null
python -c 'from pstats import Stats; Stats("pip.cprof").print_stats("re.py.*\(compile\)")'
# just for testing without requiring file patching
cat <<EOF > sitecustomize.py
import re


class Proxy:
    def __init__(self, source):
        self.__source = source
        self.__created = False

    def __getattribute__(self, name):
        if name.startswith('_Proxy_'):
            return object.__getattribute__(self, name)
        if not self.__created:
            self.__created = True
            self.__object = self.__source()

        return getattr(self.__object, name)


old_compile = re.compile

def re_compile(*args, **kwargs):
    return Proxy(lambda: old_compile(*args, **kwargs))

re.compile = re_compile
EOF
PYTHONPATH=$PWD python -m cProfile -o pip.cprof -m pip >/dev/null
python -c 'from pstats import Stats; Stats("pip.cprof").print_stats("re.py.*\(compile\)")'
Example output
+ mktemp -d
+ d=/tmp/user/1000/tmp.KtQjQEhJvT
+ cd /tmp/user/1000/tmp.KtQjQEhJvT
+ python -m cProfile -o pip.cprof -m pip
+ python -c from pstats import Stats; Stats("pip.cprof").strip_dirs().print_stats("re.py.*\(compile\)")
Mon Jul  8 21:21:36 2019    pip.cprof

         295461 function calls (285279 primitive calls) in 0.224 seconds

   Random listing order was used
   List reduced from 2370 to 1 due to restriction <'re.py.*\\(compile\\)'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220    0.000    0.000    0.064    0.000 re.py:232(compile)


+ cat
+ PYTHONPATH=/tmp/user/1000/tmp.KtQjQEhJvT python -m cProfile -o pip.cprof -m pip
+ python -c from pstats import Stats; Stats("pip.cprof").strip_dirs().print_stats("re.py.*\(compile\)")
Mon Jul  8 21:21:36 2019    pip.cprof

         216540 function calls (208328 primitive calls) in 0.182 seconds

   Random listing order was used
   List reduced from 2367 to 1 due to restriction <'re.py.*\\(compile\\)'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32    0.000    0.000    0.024    0.001 re.py:232(compile)

The important parts from the output show the numbers I included in the table above.

Run 1 (no change):

         295461 function calls (285279 primitive calls) in 0.224 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220    0.000    0.000    0.064    0.000 re.py:232(compile)

Run 2 (lazy re.compile):

         216540 function calls (208328 primitive calls) in 0.182 seconds
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32    0.000    0.000    0.024    0.001 re.py:232(compile)
@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Jul 9, 2019
@cjerdonek
Copy link
Member

cjerdonek commented Jul 9, 2019

Do you have any idea how many of these compiled expressions are used during startup, and where they are defined? It seems like you could figure this out using your proxy class.

[Oh, I see now from the table it’s 32, though I still wonder which modules the unused ones are defined in, and what the used ones are needed for during startup.]

@chrahunt
Copy link
Member Author

chrahunt commented Jul 9, 2019

Here you go:

Defined

path count
/lib/python3.7/site-packages/pip/_vendor/pyparsing.py 39
/lib/python3.7/http/cookiejar.py 20
/lib/python3.7/site-packages/pip/_vendor/distlib/util.py 15
/lib/python3.7/site-packages/pip/_vendor/pytoml/parser.py 13
/lib/python3.7/html/parser.py 12
/lib/python3.7/platform.py 10
/lib/python3.7/configparser.py 8
/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py 6
/lib/python3.7/logging/config.py 6
/lib/python3.7/email/feedparser.py 5
/lib/python3.7/textwrap.py 5
/lib/python3.7/_markupbase.py 5
/lib/python3.7/http/cookies.py 4
/lib/python3.7/site-packages/pip/_internal/vcs/subversion.py 4
/lib/python3.7/email/header.py 3
/lib/python3.7/email/utils.py 3
/lib/python3.7/distutils/sysconfig.py 3
/lib/python3.7/site-packages/pip/_vendor/packaging/version.py 3
/lib/python3.7/site-packages/pip/_vendor/packaging/specifiers.py 3
/lib/python3.7/json/encoder.py 3
/lib/python3.7/site-packages/pip/_vendor/chardet/universaldetector.py 3
/lib/python3.7/site-packages/pip/_internal/models/link.py 3
/lib/python3.7/site-packages/pip/_internal/wheel.py 3
/lib/python3.7/site-packages/pip/_internal/req/req_file.py 3
/lib/python3.7/glob.py 2
/lib/python3.7/http/client.py 2
/lib/python3.7/distutils/fancy_getopt.py 2
/lib/python3.7/plistlib.py 2
/lib/python3.7/site-packages/pip/_vendor/colorama/ansitowin32.py 2
/lib/python3.7/json/decoder.py 2
/lib/python3.7/urllib/request.py 2
/lib/python3.7/site-packages/pip/_vendor/requests/utils.py 2
/lib/python3.7/site-packages/pip/_internal/download.py 2
/lib/python3.7/site-packages/pip/_vendor/html5lib/_inputstream.py 2
/lib/python3.7/site-packages/pip/_vendor/html5lib/serializer.py 2
/lib/python3.7/locale.py 1
/lib/python3.7/string.py 1
/lib/python3.7/urllib/parse.py 1
/lib/python3.7/email/_encoded_words.py 1
/lib/python3.7/email/message.py 1
/lib/python3.7/gettext.py 1
/lib/python3.7/distutils/dist.py 1
/lib/python3.7/site-packages/pip/_vendor/packaging/utils.py 1
/lib/python3.7/shlex.py 1
/lib/python3.7/html/__init__.py 1
/lib/python3.7/json/scanner.py 1
/lib/python3.7/site-packages/pip/_vendor/idna/core.py 1
/lib/python3.7/encodings/idna.py 1
/lib/python3.7/site-packages/pip/_vendor/cachecontrol/controller.py 1
/lib/python3.7/site-packages/pip/_internal/utils/encoding.py 1
/lib/python3.7/xml/etree/ElementPath.py 1
/lib/python3.7/site-packages/pip/_internal/pep425tags.py 1
/lib/python3.7/site-packages/pip/_vendor/distlib/scripts.py 1
/lib/python3.7/site-packages/pip/_internal/index.py 1
/lib/python3.7/site-packages/pip/_vendor/pytoml/utils.py 1
/lib/python3.7/site-packages/pip/_internal/vcs/git.py 1

Used

path count
/lib/python3.7/site-packages/pip/_vendor/pyparsing.py 6
/lib/python3.7/configparser.py 5
/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py 3
/lib/python3.7/textwrap.py 3
/lib/python3.7/http/client.py 2
/lib/python3.7/site-packages/pip/_vendor/packaging/specifiers.py 2
/lib/python3.7/site-packages/pip/_vendor/packaging/version.py 2
/lib/python3.7/json/decoder.py 2
/lib/python3.7/_markupbase.py 2
/lib/python3.7/email/_encoded_words.py 1
/lib/python3.7/platform.py 1
/lib/python3.7/shlex.py 1
/lib/python3.7/http/cookies.py 1

Generated with

test.sh
#!/bin/sh

set -x
d="$(mktemp -d)"
cd "$d"
cat <<'EOF' > sitecustomize.py
import atexit
import re
import traceback

from collections import Counter


used = Counter()
created = Counter()


def format_counter(c):
    return (
        '|path|count|\n'
        '|-|-|\n' +
        '\n'.join(f'|`{k}`|{v}|' for k, v in c.most_common())
    )


def show_details():
    print('**Defined**\n')
    print(format_counter(created))
    print('\n**Used**\n')
    print(format_counter(used))


atexit.register(show_details)


class Proxy:
    def __init__(self, key, origin, source):
        self.__key = key
        self.__origin = origin
        self.__source = source
        self.__created = False

    def __getattribute__(self, name):
        if name.startswith('_Proxy_'):
            return object.__getattribute__(self, name)
        if not self.__created:
            print(f'used {self.__origin}')
            used[self.__key] += 1
            self.__created = True
            self.__object = self.__source()

        return getattr(self.__object, name)


def get_origin(tb):
    caller = tb[-2]
    _, file, *_ = caller.split()
    start = file.index('/lib/')
    return caller, file[start:-2]


old_compile = re.compile


def re_compile(*args, **kwargs):
    origin, file = get_origin(traceback.format_stack())
    print(origin)
    created[file] += 1
    return Proxy(file, origin, lambda: old_compile(*args, **kwargs))

re.compile = re_compile
EOF
PYTHONPATH=$PWD python -m pip

@cjerdonek
Copy link
Member

Thanks!

The reason I ask, by the way, is that there's another alternative, which is not to import modules / code that isn't needed for startup. Like, you could imagine pip having a minimal startup sub-package, where we put only the code needed for startup. That would give us greater control / visibility into what vendored libraries and standard library modules are needed for startup based on what (module-level) imports are in the startup sub-package.

@chrahunt
Copy link
Member Author

chrahunt commented Jul 9, 2019

I have tried tackling that, but the most difficult part for me is having enough and the right kind of testing to avoid regressions and increased overhead on development and code review.

Assuming this approach doesn't have major downsides, I don't think that they are mutually exclusive. It might be worth making a dedicated issue for each possible improvement so we can give each one focus in turn.

@cjerdonek
Copy link
Member

cjerdonek commented Jul 9, 2019

I guess for me personally I'd be more interested in reviewing and seeing progress on the more general dependency / import issue. Like, why does pip's startup need to import http/cookiejar.py and its 20 regexes, and why does it need to e.g. load the whole download.py and index.py machinery? While it may have an effect, swapping out re.compile() seems more like it's masking / papering over a deeper issue that should be fixed. If our dependencies are structured right, there may be little need to do anything with re.compile(), and I think we'll find that we'll get lots of other benefits for free, too.

As for tackling the dependency issue, I think it's just a matter of deciding on a target for how we want things to be and an incremental plan to get there, and then to do the PR's one at a time. The PR's should mostly just be moving things. And those kinds of changes would I think already be covered by tests since things would fail quickly if e.g. a function was imported from the wrong module. We'd probably want to settle on some kind of light-weight convention or pattern where we import the command or operation code we need inside a function once the code knows what command needs to be executed. There are also many opportunities for doing things towards this goal that can be done even without an overall goal / plan, like moving utility functions out of download.py and other "refactoring" / simplification of the module dependency graph.

@cjerdonek
Copy link
Member

cjerdonek commented Jul 9, 2019

Like, it seems like we shouldn't need to actually import all the command classes here just to get their name and summary in order to parse the command name:

commands_order = [
InstallCommand,
DownloadCommand,
UninstallCommand,
FreezeCommand,
ListCommand,
ShowCommand,
CheckCommand,
ConfigurationCommand,
SearchCommand,
WheelCommand,
HashCommand,
CompletionCommand,
DebugCommand,
HelpCommand,
] # type: List[Type[Command]]

It looks like the command class can simply be imported here once the name is parsed from the arguments and known:

command = commands_dict[cmd_name](isolated=("--isolated" in cmd_args))

@cjerdonek cjerdonek added state: needs discussion This needs some more discussion type: enhancement Improvements to functionality labels Jul 9, 2019
@triage-new-issues triage-new-issues bot removed S: needs triage Issues/PRs that need to be triaged labels Jul 9, 2019
@pfmoore
Copy link
Member

pfmoore commented Jul 9, 2019

I'd caution against getting sucked into looking at wider issues all at once. This is a nice, focused improvement which seems like it might help. Let's not block progress just for the sake of other potential improvements that may or may not happen. It's not like we can't do both.

So +1 from me on investigating this possibility further, and looking at other improvements in their own PRs.

@cjerdonek
Copy link
Member

I filed a separate issue for what I was suggesting: #6692. It's similarly focused and gets more at the root cause IMO as many of the re.compile lines will no longer be in the import path.

@xavfernandez
Copy link
Member

xavfernandez commented Jul 9, 2019

Given that cpython automatically caches the last 512 regexes used, is there really a need to pre-compile all those regexes ?

@pradyunsg
Copy link
Member

Related to #4768.

@pradyunsg
Copy link
Member

is there really a need to pre-compile all those regexes?

Interesting. The idea here however, IIUC, is to defer the compiling to later, instead of pre-compiling.

@ichard26 ichard26 added the type: performance Commands take too long to run label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: needs discussion This needs some more discussion type: enhancement Improvements to functionality type: performance Commands take too long to run
Projects
None yet
Development

No branches or pull requests

6 participants