Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] options.packages.find.exclude not taking effect when include_package_data = True #3260

Open
abravalheri opened this issue Apr 11, 2022 · 10 comments
Labels
bug Needs Triage Issues that need to be evaluated for severity and status.

Comments

@abravalheri
Copy link
Contributor

setuptools version

62.1.0

Python version

Python 3.10.4 (main, Mar 30 2022, 23:50:39) [GCC 10.2.1 20210110]

OS

Tested in Alpine Linux 3.14, Debian bullseye and Ubuntu 20.04.4 LTS

Additional environment information

Python 3.10.4 (main, Mar 30 2022, 23:50:39) [GCC 10.2.1 20210110]

Package    Version
---------- -------
build      0.7.0
packaging  21.3
pep517     0.12.0
pip        22.0.4
pyparsing  3.0.8
setuptools 62.1.0
tomli      2.0.1
wheel      0.37.1

Description

For some reason, when include_package_data = True, subpackages excluded via options.packages.find.exclude are still added to the wheel.

Expected behavior

When listed via options.packages.find.exclude a subpackage should not be added to the wheel.

How to Reproduce

> docker run --rm -it python:3.10-bullseye /bin/bash

mkdir /tmp/mypkg
cd /tmp/mypkg
mkdir -p mypkg/tests
touch mypkg/__init__.py
touch mypkg/resource.txt
touch mypkg/tests/test_mypkg.py
touch mypkg/tests/test_file.txt

cat <<EOS > pyproject.toml
[build-system]
requires = ['setuptools']
build-backend = 'setuptools.build_meta'
EOS

cat <<EOS > setup.cfg
[metadata]
name = mypkg
version = 42

[options]
include_package_data = True
packages = find:

[options.packages.find]
exclude = *.tests*
EOS

cat <<EOS > MANIFEST.in
global-include *.py *.txt
global-exclude *.py[cod]
prune dist
prune build
EOS

python -m pip install -U pip build setuptools wheel
python -VV
python -m pip list

rm -rf dist build mypkg.egg-info
python -m build --no-isolation
# `--no-isolation` is used to guarantee we have control over the
# versions of the build dependencies
unzip -l dist/*.whl

Output

  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2022-04-11 17:49   mypkg/__init__.py
        0  2022-04-11 17:49   mypkg/resource.txt
        0  2022-04-11 17:49   mypkg/tests/__init__.py
        0  2022-04-11 17:49   mypkg/tests/test_file.txt
        0  2022-04-11 17:49   mypkg/tests/test_mypkg.py
      108  2022-04-11 17:49   mypkg-42.dist-info/METADATA
       92  2022-04-11 17:49   mypkg-42.dist-info/WHEEL
        6  2022-04-11 17:49   mypkg-42.dist-info/top_level.txt
      654  2022-04-11 17:49   mypkg-42.dist-info/RECORD
---------                     -------
      860                     9 files

The expected output should not include:

- mypkg/tests/__init__.py
- mypkg/tests/test_file.txt
- mypkg/tests/test_mypkg.py
@abravalheri abravalheri added bug Needs Triage Issues that need to be evaluated for severity and status. labels Apr 11, 2022
@abravalheri abravalheri changed the title [BUG] options.packages.find.exclude not making effect when include_package_data = True [BUG] options.packages.find.exclude not taking effect when include_package_data = True Apr 11, 2022
@jaraco
Copy link
Member

jaraco commented Apr 16, 2022

The expected output should not include...

The issue I see with this expectation is that under other circumstances, a subfolder in a package is data.

Consider if all you had is:

mypkg/tests/test_file.txt

In that case, the test file would be included as package data. Then if you add a python file to the directory:

mypkg/tests/test_file.txt
mypkg/tests/template.py

Now neither file gets included as data. That behavior would be counterintuitive and conflates two concerns.

Moreover, it seems perfectly legitimate to have a directory of python files as data, especially if they're not intended to be imported through the package, but used in another way (such as a template or scripts or other non-module behavior in a Python syntax).

I don't see a good solution here, except for Setuptools to get out of the business of managing packages, modules, and their data and to instead model discovery of files for the build that could be packages or modules or data, and provide mechanisms to include/exclude those generally.

@abravalheri
Copy link
Contributor Author

Thank you very much for the detailed explanation @jaraco.

I think that if we change mypkg/tests/test_mypkg.py to mypkg/tests/__init__.py, the example becomes fairly less controversial.

With this change it is more obvious that mypkg.tests is a package, and therefore the expected outcome would be the entire mypkg/tests folder to not be included in the wheel (due to [options.packages.find]). Does this make sense?

In this case, the existing implementation would still behave in an unexpected way and include the directory that is supposed to be excluded.

@abravalheri
Copy link
Contributor Author

The issue I see with this expectation is that under other circumstances, a subfolder in a package is data.

Would it be better if we consider all the subfolders as nested packages and then require the users to explicitly include this package?

I am OK with this assumption. This would also be coherent with Python's existing behaviour1.

Moreover, it seems perfectly legitimate to have a directory of python files as data, especially if they're not intended to be imported through the package, but used in another way (such as a template or scripts or other non-module behavior in a Python syntax).

This is a bit tricky... How do we know which Python files are meant to be data files or actual modules?
I think it should be fair to assume that anything that can be reached via the import ... or from ... import ... syntax is a legitimate Python module/package.

I don't see a good solution here, except for Setuptools to get out of the business of managing packages, modules, and their data and to instead model discovery of files for the build that could be packages or modules or data, and provide mechanisms to include/exclude those generally.

Please let me know what are your thoughts about this matter. I would be more than happy to participate in this discussion and see if I can help somehow.

Footnotes

  1. I don't know if it is accidental but after PEP 420 a folder nested inside of any sys.path entries will be "importable" even if it does not contain a __init__.py file - as long as all the directories are valid Python identifiers.

@jaraco
Copy link
Member

jaraco commented Apr 24, 2022

  1. I don't know if it is accidental but after PEP 420 a folder ... will be "importable"

It's intentional. During the design, we considered requiring a namespace package to have some sort of indicator that it's a Python package, but ultimately decided that the simpler approach of recognizing any directory as a package was more elegant.

That's why:

$ mkdir foo
$ python -c "import foo; print(foo.__name__)"
foo

That is, an empty directory is an empty Python package. Simple and elegant, though somewhat ambiguous (because a directory of rust code is also a Python package whose sole contents is rust files as data).

require the users to explicitly include this package?

I very much want to avoid users having to uniquely indicate items to be included. I'd like for there to be a discovery mechanism that works for 99% of cases and have options to tweak when the discovery is inaccurate. This is part of the zero-config vision... a user should be able to create a null distribution with no config, then add a folder to it, which creates an empty Python package, then add modules to that package, all having to duplicately tell Setuptools what they just did.

That is, Setuptools should infer the same thing that Python does for imports. So by default, what you create for Python gets packaged.

This is a bit tricky... How do we know which Python files are meant to be data files or actual modules?

I guess we can't reliably. I'd say we'd have to assume everything is a legitimate Python package unless indicated otherwise (such as through an exclusion directive or heuristic).

@abravalheri
Copy link
Contributor Author

abravalheri commented Apr 25, 2022

Thank you very much for the information Jason.

I very much want to avoid users having to uniquely indicate items to be included. I'd like for there to be a discovery mechanism that works for 99% of cases and have options to tweak when the discovery is inaccurate. This is part of the zero-config vision... a user should be able to create a null distribution with no config, then add a folder to it, which creates an empty Python package, then add modules to that package, all having to duplicately tell Setuptools what they just did.

If the users adopt include_package_data=True (the default for projects using PEP 621) + setuptools-scm (or relevant VCS plugin), they would be not that far away from this situation.

The difference is that we currently have 2 mechanisms to "trim the excesses":

  • exclude_package_data to remove some of the non Python files
  • find.packages.exclude to remove entire folders from the build (supposedly)

The second mechanism is broken by the current implementation of include_package_data.

Maybe going forward we would want:

  • To add all existing files inside of a package even if the user don't adopt setuptools-scm/relevant VCS plugin.
    When the user does adopt a VCS plugin, instead of adding all files we just add the ones given by the plugin.
    I don't know what would be the implications of this approach with compiled extensions
  • A single mechanism for removing any kind of "extra" files and folders, regardless if they are folders, Python scripts or data files.

@abravalheri
Copy link
Contributor Author

Status update:

In v62.3.0, we introduce a deprecation warning about the conflicting behaviour. After some time we can start tackling this issue.

@bittner
Copy link
Contributor

bittner commented Jun 29, 2023

Just as another voice to illustrate the current situation:

I try to use pyproject.toml as the one and only place for my package configuration. I stick to setuptools and have added setuptools_scm, because I need to add static files (templates, CSS, JS) and I didn't manage to include them in a generic way. (Yes, it's a Django project that I'm packaging.)

What I find irritating is that the find.exclude directive doesn't seem to have any effect (in my case?) neither on sdist nor on wheels. I need to add a MANIFEST.in file to make configuration files stay clear from the sdist package.

Example pyproject.toml

[build-system]
requires = ["setuptools>=61.0", "setuptools_scm[toml]>=7"]
build-backend = "setuptools.build_meta"

# ...

[tool.setuptools.packages.find]
exclude = ["tests*", ".*config", ".*ignore", ".*.yml"]
namespaces = false

The exclude doesn't have any effect here. When I run python -m build the project's tests/ folder is included entirely, just as all the configuration files. I can even remove the exclude = ... line above and the result remains the same: The wheel stays clear of tests and config files, the sdist doesn't.

MANIFEST.in needed

graft */static
graft */templates
global-exclude *.py[cod]
exclude MANIFEST.in tox.ini .*config .*ignore .*.yml
prune static
prune tests

Adding this MANIFEST.in file seems to be the only way to make tests and configuration files not included in the sdist. For the wheel the default algorithm seems to do the trick, on the other hand. (namespaces = false helps to keep directories like venv out of sight.)

Developer expectations

As a developer I would expect the options in the tool.setuptools.packages.find section to apply to both wheel and sdist. To everything that python -m build does, really.

(Side note: With the bad news of having to maintain a MANIFEST.in file it seems like the good news is that I can drop setuptools_scm, which has its own set of complications with Tox and CI/CD. But really, I'd like to do without additional files and tools. A pragmatic approach to synchronize the two worlds might be to generate a MANIFEST.in file on-the-fly from the configuration options in pyproject.toml's setuptools sections.)

@abravalheri
Copy link
Contributor Author

Hi @bittner, please note that the link you referenced mentions tool.setuptools.package-data, tool.setuptools.include-package-data and tool.setuptools.exclude-package-data. However your example makes use of a different field: tool.setuptools.packages.find.excludes.

Have you also tried to use the field mentioned by the documentation you linked?
The following example seem to work for me:

> docker run --rm -it python:3.9-bullseye /bin/bash
mkdir -p /tmp/myproj/myproj/
cd /tmp/myproj
touch myproj/.foobarignore
touch myproj/.foobarconfig
touch myproj/.helloworld.yml
touch myproj/__init__.py
touch myproj/tests123

cat <<EOF > pyproject.toml
[build-system]
requires = ["setuptools", "setuptools_scm[toml]>=7"]
build-backend = "setuptools.build_meta"

[project]
name = "hello_world"
dynamic = ["version"]

[tool.setuptools]
exclude-package-data = { "*" = ["tests*", ".*config", ".*ignore", ".*.yml"]}
EOF

git init .
git add .
git config --global user.email "you@example.com"
git commit -m "Initial commit"
python -m pip install -U build
python -m build
unzip -l dist/*.whl
# ...
# Successfully built hello_world-0.0.0.tar.gz and hello_world-0.0.0-py3-none-any.whl
# Archive:  dist/hello_world-0.0.0-py3-none-any.whl
#   Length      Date    Time    Name
# ---------  ---------- -----   ----
#         0  2023-06-30 12:39   myproj/__init__.py
#        56  2023-06-30 12:39   hello_world-0.0.0.dist-info/METADATA
#        92  2023-06-30 12:39   hello_world-0.0.0.dist-info/WHEEL
#         7  2023-06-30 12:39   hello_world-0.0.0.dist-info/top_level.txt
#       383  2023-06-30 12:39   hello_world-0.0.0.dist-info/RECORD
# ---------                     -------
#       538                     5 files

@bittner
Copy link
Contributor

bittner commented Jul 2, 2023

The following example seem to work for me:

Sure, the wheel is fine. It's the sdist that doesn't work for excluding tests and configuration files.

tar tvfz dist/*.tar.gz 
# drwxr-xr-x         0 hello_world-0.0.0/
# -rw-r--r--        55 hello_world-0.0.0/PKG-INFO
# drwxr-xr-x         0 hello_world-0.0.0/hello_world.egg-info/
# -rw-r--r--        55 hello_world-0.0.0/hello_world.egg-info/PKG-INFO
# -rw-r--r--       254 hello_world-0.0.0/hello_world.egg-info/SOURCES.txt
# -rw-r--r--         1 hello_world-0.0.0/hello_world.egg-info/dependency_links.txt
# -rw-r--r--         7 hello_world-0.0.0/hello_world.egg-info/top_level.txt
# drwxr-xr-x         0 hello_world-0.0.0/myproj/
# -rw-r--r--         0 hello_world-0.0.0/myproj/.foobarconfig
# -rw-r--r--         0 hello_world-0.0.0/myproj/.foobarignore
# -rw-r--r--         0 hello_world-0.0.0/myproj/.helloworld.yml
# -rw-r--r--         0 hello_world-0.0.0/myproj/__init__.py
# -rw-r--r--         0 hello_world-0.0.0/myproj/tests123
# -rw-r--r--       258 hello_world-0.0.0/pyproject.toml
# -rw-r--r--        38 hello_world-0.0.0/setup.cfg

Shouldn't the result be roughly the same?

In my understanding, the exclude-package-data only deals with package data, hence what is in the project root folder is not covered by the directive (but probably by the default file exclusion algorithm as hinted in the docs). If you add a tests/ folder that is successfully excluded. The root-level (configuration) files not.

@abravalheri
Copy link
Contributor Author

Shouldn't the result be roughly the same?

Not really. The interpretation of what sdists should be varies from person to person, but the general idea is of a platform independent intermediate artefact1.

There is a line of thought (to which I personally subscribe), that considers sdist as a "development snapshot" on steroids; i.e., it contains everything needed for a developer to build, test and perform all normal maintenance tasks, plus Python-specific packaging metadata. Distribution re-packagers usually that approach and encourage devs to leave docs and test files in the sdists (see discussion in https://discuss.python.org/t/should-sdists-include-docs-and-tests/14578).

According to this particular definition, sdists can diverge substantially from wheels.

In my understanding, the exclude-package-data only deals with package data, hence what is in the project root folder is not covered by the directive (but probably by the default file exclusion algorithm as hinted in the docs). If you add a tests/ folder that is successfully excluded. The root-level (configuration) files not.

That is exactly what happens, *package-data (and tool.setuptools.packages.find as well) impact the files inside package directories (the ones that get installed in the site-packages folder in your computer), not the root.

The packaging process for Python is particular, but you can think about about it in 2 stages:

flowchart LR
    0[working dir] -->|"(A) build_sdist"| sdist
    sdist -->|"(B) build_wheel"| wheel

If you want to modify the contents of the sdist (process A), you cannot do that via the pyproject.toml. Either you have to use MANIFEST.in or rely on the selection done by your plugin of choice (e.g. setuptools-scm).

setuptools-scm is quite pragmatic on that regard. It will consider everything tracked in the versioning system as "useful" for development purposes and therefore "worthy" to be part of the sdist. It is more a "convention-over-configuration" kind of thing. If this approach is not your cup of tea, MANIFEST.in is what you got.

Please note that the original problem reported in this issue is not about contents of sdist but rather about the inconsistencies when configuring include_package_data and packages.find.exclude when it comes to the final wheel artefact.

Footnotes

  1. The spectrum of "intermediate" is quite big. It can vary from pure raw source code to "pre-compiled" objects. E.g., let's consider Cython that can translate .pyx files into .c files. You could have an sdist with the original .pyx files, but you could also have an sdist with the intermediate .c, since those are also platform independent. The wheel, on the other hand will contain only the final compiled .so files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Needs Triage Issues that need to be evaluated for severity and status.
Projects
None yet
3 participants