Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hooks: pkg_resources: implement support for package content listing #5284

Merged
merged 6 commits into from Mar 23, 2021

Conversation

rokm
Copy link
Member

@rokm rokm commented Oct 25, 2020

Implement a custom pkg_resources provider for our FrozenImporter, by subclassing a pkg_resources.NullProvider and overriding its _has(), _isdir(), and _listdir() methods. These are required/used by:

  • resource_exists() function (or provider's has_resource() method)
  • resource_isdir() function (or provider's resource_isdir() method)
  • resource_listdir() function (or provider's resource_listdir() method)

and the default implementations provided by NullProvider raise NotImplementedError("Can't perform this operation for unregistered loader type").

The implementation supports both external data files (the ones collected into the frozen application's directory, e.g. from wheels) and embedded data files (collected into PYZ archive, e.g. from eggs). The former should suffice to fix the pkg_resources problems with metpy (#5247), branca (#5256) and folium (#5262).

Fixes the "Providers" part of #4881.

The behavior of the new PyiFrozenProvider is modelled after DefaultProvider and ZipProvider. Wherever the behavior between the two is inconsistent, we implement the one that makes more sense for our case. The one deviation
in behavior is that in contrast to native providers, we cannot list the source .py files, as they do not exist in the frozen application.

The two new added tests, test_pkg_resources_provider_source and test_pkg_resources_provider_frozen verify the behavior of providers in original (source) and frozen application. They both use the same script with several behavior checks with test package in either package or egg form. The "source" test can be seen as a sort of a sanity check for the "frozen" one; by design of the test script and transitivity of its results, both tests passing indicate that PyiFrozenProvider implements behavior that is in line with the behavior of the two reference providers.

Copy link
Member

@Legorooj Legorooj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shows the quality of your work that the only problems I can find are typos! However just for safety's sake, I'll ask @bwoodsend to review this as well.

PyInstaller/hooks/rthooks/pyi_rth_pkgres.py Outdated Show resolved Hide resolved
PyInstaller/hooks/rthooks/pyi_rth_pkgres.py Outdated Show resolved Hide resolved
@bwoodsend
Copy link
Member

I'd feel slightly more comfortable if those resource files weren't empty. When I do:

>>> pkg_resources.resource_string("pyi_pkgres_testpkg", "subpkg1/data/extra/extra_entry1.txt")
b''

I question if something went wrong - even though I know you get a FileNotFoundError if something genuinely has gone wrong.

Copy link
Member

@bwoodsend bwoodsend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been code.interact() messing about with this and it's very satisfyingly consistent with unfrozen Python.

@rokm
Copy link
Member Author

rokm commented Oct 27, 2020

I'd feel slightly more comfortable if those resource files weren't empty. When I do:

>>> pkg_resources.resource_string("pyi_pkgres_testpkg", "subpkg1/data/extra/extra_entry1.txt")
b''

I question if something went wrong - even though I know you get a FileNotFoundError if something genuinely has gone wrong.

Good point - I'll populate the data files in test package.

@htgoebel htgoebel self-requested a review November 1, 2020 13:57
Copy link
Member

@htgoebel htgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! You've been so brave going down into the catacombs of pkg_resources. Wow!
Thanks for implementing this, this solves a major pain we have since long. And I like the idea of transparently including the files in the file-system.

Beside some nitpicking (see below and line comments), I have two relevant points:

1) Which takes precedense: PYZ or filesystem? This should be the same behavior as if a module is both in the PYZ and in the file-system (due to collect_data_files, including .pyc). I don't have the asnwer at hand, so please have a look at the code or test it.
Of course this needs to be documents, at least in the relase notes and at the top of the runt-time-hook..

2) AFAIU, TocFilesystem structure is a dict of dicts, containing pathlib.PurePaths as leafs. IMHO building this is much too much overhead:

  • Over the run-time of a program, resource access is rare and typically not done in a inner loop. This our Provider should not spend too much time, memory and eletrical power on speeding up the lookup of a resource. Chances are high that only a handful of resources will be requested even for a huge program.
    This is especially true since AFAIS, this is done for each call of get_provider (which is called again by echo resource_*() call), and not cached:
>>> p1 = get_provider("ansible")
>>> p2 = get_provider("ansible")
>>> id(p1)
140475825572560
>>> id(p2)
140475825511888
  • While using pathlib here is elegant and simple, the overhead is huge. Again, esp. since results seem not to be cached, and the current code is touching many strings and files.
  • Thus I suggest you have a look at ZipProvider and seem cant be imitated from there.

And now here are the minor remarks.

  • Please put at least a # (and line-ending) into the empty files. Empyt file are easily deleted accidentally.
  • Please choose different extensions for the data files, to avoid only .txt-files are collected ;-) If you want to stick with text-like file extensions, .rst and .md are good choices.
  • Please provide a Makefile for recreating the .egg. Some distributions liek Debian want to have pure source code, and a zip file is a binary.
  • Please use present tense in the commit messages.

tests/functional/test_pkg_resources_provider.py Outdated Show resolved Hide resolved
tests/functional/test_pkg_resources_provider.py Outdated Show resolved Hide resolved
tests/functional/test_pkg_resources_provider.py Outdated Show resolved Hide resolved
tests/functional/test_pkg_resources_provider.py Outdated Show resolved Hide resolved
tests/functional/test_pkg_resources_provider.py Outdated Show resolved Hide resolved
PyInstaller/hooks/rthooks/pyi_rth_pkgres.py Outdated Show resolved Hide resolved
tests/functional/scripts/pyi_pkg_resources_provider.py Outdated Show resolved Hide resolved
PyInstaller/hooks/rthooks/pyi_rth_pkgres.py Show resolved Hide resolved
PyInstaller/hooks/rthooks/pyi_rth_pkgres.py Outdated Show resolved Hide resolved
@rokm
Copy link
Member Author

rokm commented Nov 6, 2020

Thanks for the detailed review!

Impressive work! You've been so brave going down into the catacombs of pkg_resources. Wow!
Thanks for implementing this, this solves a major pain we have since long. And I like the idea of transparently including the files in the file-system.

Perhaps it's worth pointing out that from the issues I've seen, we could probably get away by implementing only the on-filesystem part of the behavior here.

If we aimed for that, it would probably be enough to just subclass the DefaultProvider, and override the _get() method to use the original implementation from NullProvider. Tthe implementation from DefaultProvider is filesystem-only, while the one from NullProvider uses loader.get_data(), and thus in case of FrozenImporter supports both on-filesystem and PYZ-embedded resources.

That's the reason why the initial attempt to use DefaultProvider instead of NullProvider broke the existing test_egg_zipped test in test_import.py.

Beside some nitpicking (see below and line comments), I have two relevant points:

1) Which takes precedense: PYZ or filesystem? This should be the same behavior as if a module is both in the PYZ and in the file-system (due to collect_data_files, including .pyc). I don't have the asnwer at hand, so please have a look at the code or test it.
Of course this needs to be documents, at least in the relase notes and at the top of the runt-time-hook..

The PYZ-embedded resources take precedence. This is to keep behavior consistent with FrozenImporter.get_data(), which also first checks the TOC and then filesystem. (This method is used by NullProvider. and PyiFrozenProvider._get(), and in turn by pkg_resources.resource_string()).

2) AFAIU, TocFilesystem structure is a dict of dicts, containing pathlib.PurePaths as leafs. IMHO building this is much too much overhead:

  • Over the run-time of a program, resource access is rare and typically not done in a inner loop. This our Provider should not spend too much time, memory and eletrical power on speeding up the lookup of a resource. Chances are high that only a handful of resources will be requested even for a huge program.
    This is especially true since AFAIS, this is done for each call of get_provider (which is called again by echo resource_*() call), and not cached:
>>> p1 = get_provider("ansible")
>>> p2 = get_provider("ansible")
>>> id(p1)
140475825572560
>>> id(p2)
140475825511888

Indeed, TocFilesystem is a prefix tree implemented as dict of dicts, where leaf nodes are either:

  • a string - last component of a file name (EDIT: actually, the leaf node is an empty string, and indicates that the parent is the last component of a filename)
  • empty dict - for indicating that the node is an empty directory; this is used to emulate PYZ-embedded sub-pacakge directories that originally contained only .py files and are now empty.

The prefix tree is there not so much to speed up the resource look-up, but more due to the fact that if we want to emulate PYZ-embedded filesystem, we need to somehow reconstruct the directory structure from filenames in the TOC. I could not come up with a better idea to do so - perhaps you have one?

But I agree with your performance concerns. Would you consider the following two modifications sufficient to alleviate those?

  • Caching of prefix trees (in memory, for the duration of the program run-time); i.e., a global dict with base package names for keys and TocFileystem structures for values. We could then also optimize this further to reuse the parts of base package trees for sub-packages and sub-modules, if necessary.

  • The prefix tree is actually needed only by the newly-implemented methods (_has(), _isdir() and _listdir()). We could defer the prefix tree creation (/retrieval from cache) until the first call to one of these methods. This way, just creating the Provider without using any resource query/listing will incur no performance penalty.

  • While using pathlib here is elegant and simple, the overhead is huge. Again, esp. since results seem not to be cached, and the current code is touching many strings and files.

  • Thus I suggest you have a look at ZipProvider and seem cant be imitated from there.

OK, I will take a closer look at ZipProvider implementation. If I recall correctly, it has split indices for directories and for files. Would you prefer this design (over the current combined prefix tree), or was your comment more in line with removing the use of pathlib in favor of os.path-based path manipulation?

And now here are the minor remarks.

  • Please put at least a # (and line-ending) into the empty files. Empyt file are easily deleted accidentally.

Aha, good point. I filled in the datafiles as per an earlier comment, but the .py files are indeed blank. Will add a comment line in them.

  • Please choose different extensions for the data files, to avoid only .txt-files are collected ;-) If you want to stick with text-like file extensions, .rst and .md are good choices.

Sure, we can diversify the extensions and contents of datafiles a bit.

  • Please provide a Makefile for recreating the .egg. Some distributions liek Debian want to have pure source code, and a zip file is a binary.

Good point. I thought zipped eggs were allowed, since we already have pyi_egg_zipped.egg checked into the repository...

But on the other hand, keeping the "source" and zipped egg in sync is somewhat a hassle, so it might actually be better to have the test build the egg from the source and use it...

  • Please use present tense in the commit messages.

Yeah, the commit messages are a bit of a mess right now, and reflect the actual incremental development history... I was hoping they would be squashed together with a summary message from the PR. Otherwise, I will squash/reorganize them a bit once all other concerns are addresses and the PR is ready for merge...

Copy link
Member

@htgoebel htgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside some minor nit-picking, this looks good for me. Okay to merge after this have been done and commits cleaned up.

(I would have applied these nit-picking changes myself, but commits are not yet clean up.)

news/5284.hooks.rst Outdated Show resolved Hide resolved
@rokm
Copy link
Member Author

rokm commented Jan 8, 2021

Alright, here's the updated PR, rebased on top of develop.

I've squashed all test-related commits into first commit, and most initial incremental work on the PyFrozenProvider into second one. Couple of larger subsequent modifications are still in separate commits, but we can squash them all into the second one, if that's preferable.

rokm added 6 commits March 3, 2021 10:41
The test script tests for behavior of resource_exists(), resource_isdir(),
and resource_listdir() functions from pkg_resources package (which
in turn call the methods with same name in the provider class).

The idea is to run the test script twice, once as unfrozen python
script and once as a frozen program. In both cases, the test package
is once present as a plain package directory and once as a zipped egg
(generated on-the-fly from the source directory).

This way, we test behavior of the original provider (DefaultProvider
or ZipProvider) and the provider used within the frozen application
(which we will need to implement to replace the currently used
NullProvider).
Implement PyiFrozenProvider that subclasses NullProvider from
pkg_resources and provides _has(), _isdir(), and _listdir()
methods. The implementation of these methods supports both
PYZ-embedded and on-filesystem resources, in that order of
precedence.
Because a directory may exist as both an embedded and on-filesystem
resource, we need to de-duplicate the results when listing the
filesystem in addition to embedded tree.
Add a block describing basic behavior of PyiFrozenProvider, w.r.t.
to PYZ-embedded and on-filesystem resources.
@bwoodsend bwoodsend merged commit 94c333d into pyinstaller:develop Mar 23, 2021
@rokm rokm deleted the pkg-resources branch May 5, 2021 09:47
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants