Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of entry-point parsing #283

Closed
jaraco opened this issue Feb 21, 2021 · 3 comments · Fixed by #317
Closed

Improve efficiency of entry-point parsing #283

jaraco opened this issue Feb 21, 2021 · 3 comments · Fixed by #317

Comments

@jaraco
Copy link
Member

jaraco commented Feb 21, 2021

In #281, this project added support for uniqueness of distributions when parsing entry points. This change introduced some degradation in the performance when parsing entry points (due to need to load/inspect the metadata for every project). In that PR, a couple of suggestions were made to improve the performance:

  • rely on the discovery name for the distribution rather than the proper name in metadata to disambiguate distributions
  • Rely on a custom, optimized parser instead of ConfigParser for parsing the entry points themselves.

Let's consider those two suggestions.

@anntzer
Copy link
Contributor

anntzer commented Feb 21, 2021

You need to decide how well you want to handle malformed metadata (is misparsing them OK? is throwing some random unhelpful error OK?); obviously the more you want to take care of them the slower things will be. If you are OK with "we make no guarantees for malformed metadata", then the second patch at #281 (comment) should basically be usable as-is.

Likewise, it is up to you to decide how to handle distributions with different non-normalized names but identical normalized names. If you're fine with confusing them, then the first patch in the linked issue should likewise be close to usable.

@jaraco
Copy link
Member Author

jaraco commented Feb 22, 2021

Given that most metadata is mechanically generated, I'm okay with weak error handling, but I also have a good deal of respect for regularity in parsing. That is, I'd like to avoid routines that are heavily imperative and difficult to reason about.

The normalized-names challenge concerns me more, mainly because it's going to demand consideration for distributions that don't present normalized names at all. It's currently not part of the protocol, but simply an implementation detail of PathDistributions that they have a normalized name. Distributions from another source might not have a normalized name at all. I'm less worried about uniqueness variance between PathDistributions with normalized and non-normalized names. If there's a difference, that's going to cause trouble elsewhere. Also, by relying on the normalized name as found in the filesystem, it adds a new dependency on that form, making it more difficult to later change that implementation detail. The only specified, reliable place to retrieve the name is through the metadata. This makes me wonder if maybe there's another approach that could optimize the loading of the proper name from the distribution. I think it's anything but straightforward.

@anntzer
Copy link
Contributor

anntzer commented Feb 22, 2021

OK, let's deal with the simpler (first) problem first. I'd say the simple parser I wrote in the other thread is really as simple as it gets ("skip non-empty lines; if a line starts with a bracket it's a new group (record it); else it should have shape '{key} = {value}' and corresponds to a (group, key, value) entry"). It's actually simpler to follow than the ConfigParser-based parser (unless you are super-familiar with ConfigParser details, like what optionxform does...), and also shorter (as you can also delete the unused _from_config now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants