Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring package_data when include_package_data is True #1461

Closed
Tobotimus opened this issue Aug 18, 2018 · 13 comments · Fixed by #2844
Closed

Ignoring package_data when include_package_data is True #1461

Tobotimus opened this issue Aug 18, 2018 · 13 comments · Fixed by #2844
Labels
documentation enhancement major Needs Discussion Issues where the implementation still needs to be discussed.

Comments

@Tobotimus
Copy link

Tobotimus commented Aug 18, 2018

Hi, firstly, thanks for the work you maintainers do on this library :)

I could be incorrect in saying this, but it's come to my understanding that setting include_package_data=True in setup includes data files specified by MANIFEST.in and only in MANIFEST.in. At the very least, this behaviour is not clearly explained in the docs, and at the worst it is unexpected and implicit behaviour.

Here is a sentence from the docs which seems to be trying to say the above, although elsewhere in the docs include_package_data is not explained in this way:

If using the setuptools-specific include_package_data argument, files specified by package_data will not be automatically added to the manifest unless they are listed in the MANIFEST.in file.

Doesn't this make package_data redundant when include_package_data is True? Does it really make sense to have a param called include_package_data which implicitly ignores package_data?

In summary, this is the behaviour I seem to have come across when using these options in various combinations:

  1. Including files in package_data with include_package_data=False:
    • Files specified by package_data are included as well as everything in MANIFEST.in.
  2. Including files in package_data with include_package_data=True:
    • Only the files specified by MANIFEST.in are included, files in package_data not specified by MANIFEST.in are excluded.
  3. Omitting package_data with include_package_data=True:
    • Same result as point 2.

I'm sorry if I've got this wrong, any correspondence is much appreciated :)

@pganssle pganssle added Needs Triage Issues that need to be evaluated for severity and status. enhancement major Needs Discussion Issues where the implementation still needs to be discussed. documentation and removed Needs Triage Issues that need to be evaluated for severity and status. labels Oct 19, 2018
@pganssle
Copy link
Member

I tend to agree that the intuitive behavior would either be:

  1. include_package_data + package_data does the union of everything found between the two of them
  2. an error is raised as these are incompatible arguments

My preference would be 1, but that is a backwards-incompatible change that could probably cause all kinds of issues with existing packages, and a shocking number of projects don't test their packages as installed. I think we can detect if both are specified and switch it over to a warning for a few releases, then an error.

Also, I find include_package_data to be a very misleading name. I suspect we'd be better off deprecating this keyword entirely. It could be replaced with a find_package_data function similar to find_packages. Alternatively (and I like this less), we could do something like add an include_manifest_data or include_manifest_specified_data, which would have the "union of specified package data and manifest data" behavior.

@healiseu
Copy link

As a newcomer to the Python packaging world I experienced that confusion first hand

vanschelven added a commit to vanschelven/setuptools that referenced this issue Oct 24, 2019
vanschelven added a commit to vanschelven/setuptools that referenced this issue Nov 15, 2019
@dheerajmpai
Copy link

This is completely confusing and frustrating. Hope to see this resolved very soon

jaraco pushed a commit to vanschelven/setuptools that referenced this issue Dec 22, 2019
xuanduc987 added a commit to xuanduc987/tortoise-orm that referenced this issue Jun 29, 2020
When `include_package_data=True`, only files listed in MANIFEST.in are
included and the `package_data` option is ignored.

> If using the setuptools-specific include_package_data argument, files
> specified by package_data will not be automatically added to the manifest
> unless they are listed in the MANIFEST.in file.

see:
  - https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files
  - pypa/setuptools#1461

fixes tortoise#439
grigi pushed a commit to tortoise/tortoise-orm that referenced this issue Jul 5, 2020
When `include_package_data=True`, only files listed in MANIFEST.in are
included and the `package_data` option is ignored.

> If using the setuptools-specific include_package_data argument, files
> specified by package_data will not be automatically added to the manifest
> unless they are listed in the MANIFEST.in file.

see:
  - https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files
  - pypa/setuptools#1461

fixes #439
@jaraco
Copy link
Member

jaraco commented Dec 21, 2020

I suspect this issue is related to this hacky behavior.

jasongrout added a commit to jasongrout/jupyterlab that referenced this issue Feb 12, 2021
In jupyter/jupyter-packaging#66, jupyter_packaging switched (back) to using setuptools rather than distutils for sdist. We had problems before with the staging directory not being included when using the setuptools sdist (see conda-forge/jupyterlab-feedstock#235 and jupyter/jupyter-packaging#51). It seems that the real problem we were experiencing was that when you have include_package_data=True, setuptools apparently does not include package_data files that are not given in MANIFEST.in (see pypa/setuptools#1461, for example).

This tries to get our packaging working with setuptools. We’ll be able to test this better when we actually do a release.

I tested with just running `python setup.py sdist` in a fresh checkout, and it ended up building the javascript in the staging directory, and then it included the build and node_modules directory. Thus I prune these out explicitly here. However, I think our normal build process does not build in that directory, but rather copies over files from the dev_mode directory, so the prune may not be needed in our normal build process. It doesn’t hurt to have it in, though.

It may be that we can consolidate these include statements, for example, perhaps we can just do “graft jupyterlab/staging”. However, I went for simplicity in closely mirroring what we have in package_data in setup.py.
oschwald added a commit to maxmind/MaxMind-DB-Reader-python that referenced this issue Sep 18, 2021
@layday layday mentioned this issue Oct 27, 2021
2 tasks
timabbott pushed a commit to eeshangarg/python-zulip-api that referenced this issue Oct 27, 2021
According to the `setuptools` docs, once `include_package_data=True`
is passed to `setup()`, it will only include package data specified
in `MANIFEST.in`, and will ignore the `package_data` argument passed
to `setup()`. Therefore, `py.typed` was not included in our latest
PyPI release 0.8.1.

Since we specify all of our package data in the `MANIFEST.in` file
already, it makes more sense to include mention `py.typed` there so
that it is included in the expected fashion.

See pypa/setuptools#1461 for background.

Fixes zulip#732.
timabbott pushed a commit to eeshangarg/python-zulip-api that referenced this issue Oct 27, 2021
According to the `setuptools` docs, once `include_package_data=True`
is passed to `setup()`, it will only include package data specified
in `MANIFEST.in`, and will ignore the `package_data` argument passed
to `setup()`. Thus, the `py.typed` file was not included in our
latest PyPI release 0.8.1.

A quick way to fix this is to remove the `include_package_data=True`
argument and to let our explicit `package_data` argument values
govern what data is included in the release.

See pypa/setuptools#1461 for background.
@abravalheri
Copy link
Contributor

Hi @Tobotimus. I just would like to clarify, that package_data is not exactly ignored.

According to some experiments, this is the expected behaviour:

  • In wheels:

    • exclude_package_data will ALWAYS prevent "data files" from being included in the distribution;

    • after considering that, "data files" will be included in the distribution if:

      • package_data lists them OR
      • include_package_data=True AND MANIFEST.in includes them.
  • In sdists, "data files" will be included in the distribution if:

    • MANIFEST.in includes them OR
    • include_package_data=False (or not present)
      AND package_data lists them
      AND exclude_package_data does not list them

Please notice this considers the extreme case, when the data files are placed inside directories that are not valid Python packages (e.g. missing __init__.py files or whose names are not valid python identifiers) *1.

Also have in mind that "data files" outside the package directory are no longer recommended *2.

Finally, I might have done something wrong in the experiments, so any review/correction is appreciated.

abravalheri added a commit to abravalheri/setuptools that referenced this issue Nov 1, 2021
The inconsistency for the `package_data` configuration in sdists
when `include_package_data=True` in pypa#1461 have been causing some
problems for the community for a while, as also shown in pypa#2835.

As pointed out by
[@jaraco](pypa#1461 (comment)),
this was being caused by a mechanism to break the recursion between the
`egg_info` and `sdist` commands.

In summary the loop is caused by the following behaviour:

- the `egg_info` command uses a subclass of `sdist` (`manifest_maker`)
  to calculate the MANIFEST,
- the `sdist` class needs to know the MANIFEST to calculate the data files when
  `include_package_data=True`

Previously, the mechanism to break this loop was to simply ignore
the data files in `sdist` when `include_package_data=True`.

The approach implemented in this change was to replace this mechanism,
by allowing `manifest_maker` to override the `_safe_data_files` method
from `sdist`.

---

Please notice [an extensive experiment]
(https://github.com/abravalheri/experiment-setuptools-package-data)
was carried out to investigate the previous confusing behaviour.

There is also [a simplified theoretical analysis]
(pyscaffold/pyscaffold#535 (comment))
comparing the observed behavior in the experiment and the expected
one. This analysis point out to the same offender indicated by
[@jaraco](pypa#1461 (comment))
(which is being replaced in this change).
abravalheri added a commit to abravalheri/setuptools that referenced this issue Nov 1, 2021
The inconsistency for the `package_data` configuration in sdists
when `include_package_data=True` in pypa#1461 have been causing some
problems for the community for a while, as also shown in pypa#2835.

As pointed out by
[@jaraco](pypa#1461 (comment)),
this was being caused by a mechanism to break the recursion between the
`egg_info` and `sdist` commands.

In summary the loop is caused by the following behaviour:

- the `egg_info` command uses a subclass of `sdist` (`manifest_maker`)
  to calculate the MANIFEST,
- the `sdist` class needs to know the MANIFEST to calculate the data files when
  `include_package_data=True`

Previously, the mechanism to break this loop was to simply ignore
the data files in `sdist` when `include_package_data=True`.

The approach implemented in this change was to replace this mechanism,
by allowing `manifest_maker` to override the `_safe_data_files` method
from `sdist`.

---

Please notice [an extensive experiment]
(https://github.com/abravalheri/experiment-setuptools-package-data)
was carried out to investigate the previous confusing behaviour.

There is also [a simplified theoretical analysis]
(pyscaffold/pyscaffold#535 (comment))
comparing the observed behavior in the experiment and the expected
one. This analysis point out to the same offender indicated by
[@jaraco](pypa#1461 (comment))
(which is being replaced in this change).
@zoj613
Copy link

zoj613 commented Jun 20, 2022

Has there been a fix for this? Im having issues where building an sdist will include everything no matter the settings in pyproject.toml.

@abravalheri
Copy link
Contributor

abravalheri commented Jun 20, 2022

Hi @zoj613 can you share a link to your project (or a mock project in the case it is private, replicating its behaviour), so I can have a look?

The issue has been fixed for a while.
Please note that when storing metadata in pyproject.toml, include-package-data is True by default (so you have to set it to False if you want to opt out).

@zoj613
Copy link

zoj613 commented Jun 20, 2022

@abravalheri here is the project branch: https://github.com/zoj613/polyagamma/tree/remove_poetry

This is an attempt to replace poetry with setuptools since the only dependency of the project is numpy. It uses a local C-library that is compiled via cython. I have found issues while building a sdist and wheel locally. If I use setuptools-scm as a build requirement then it seems like it ignores everything I put in pyproject.toml and include everything in the root directory no matter the combination of the config. I even attempted to add a MANIFEST.in it still did not work. If I dont include setuptools-scm then im faced with the problem of the build not including the header file folder include and the wheel distribution only including the shared library and nothing else.

I'm really not sure how to move forward now. I've read the docs multiple times and followed the instructions but still not having any luck.

The tests in this draft PR seem to pass from an editable installation but the distribution files (sdist and wheel) are completely wrong.

@abravalheri
Copy link
Contributor

abravalheri commented Jun 20, 2022

Hi @zoj613, thank you very much for providing the link.

I tried the following:

--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1,4 @@
+prune .github
+prune examples
+prune scripts
+prune tests
diff --git a/pyproject.toml b/pyproject.toml
index 27180ed..ac7bf8d 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,6 +5,7 @@ requires = [
     "setuptools>=61.0.0",
     "numpy==1.21.3; python_version=='3.10'",
     "numpy==1.19.0; python_version<='3.9'",
+    "setuptools-scm",
 ]
 build-backend = "setuptools.build_meta"

@@ -32,23 +33,12 @@ classifiers = [
 ]

 [tool.setuptools]
-zip-safe = false
-include-package-data = false
 packages = ["polyagamma"]

 [tool.setuptools.dynamic]
 version = {attr = "polyagamma.__version__"}

-[tools.setuptools.packages.find]
-exclude = [".github*", "examples*", "scripts*", "tests*"]
-include = ["include*"]
-
-[tool.setuptools.package-data]
-polyagamma = ["*.c", "*.pyi", "*.pxd"]
-"*" = ["include*"]
-
 [tool.setuptools.exclude-package-data]
-"*" = [".github", "examples", "scripts", "tests"]
 polyagamma = ["*.pyx"]

 [tool.coverage.run]

As a result, I managed to obtain the following distributions:

% unzip -l dist/*.whl
Archive:  dist/polyagamma-1.3.3-cp38-cp38-linux_x86_64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      198  2022-06-20 11:06   polyagamma/__init__.pxd
      120  2022-06-20 11:30   polyagamma/__init__.py
      168  2022-06-20 11:06   polyagamma/__init__.pyi
  1847392  2022-06-20 11:56   polyagamma/_polyagamma.cpython-38-x86_64-linux-gnu.so
      859  2022-06-20 11:06   polyagamma/_polyagamma.pxd
     3270  2022-06-20 11:06   polyagamma/_polyagamma.pyi
     1525  2022-06-20 11:56   polyagamma-1.3.3.dist-info/LICENSE
    13576  2022-06-20 11:56   polyagamma-1.3.3.dist-info/METADATA
      103  2022-06-20 11:56   polyagamma-1.3.3.dist-info/WHEEL
       11  2022-06-20 11:56   polyagamma-1.3.3.dist-info/top_level.txt
      917  2022-06-20 11:56   polyagamma-1.3.3.dist-info/RECORD
---------                     -------
  1868139                     11 files
% tar tf dist/*.tar.gz
polyagamma-1.3.3/
polyagamma-1.3.3/.gitignore
polyagamma-1.3.3/LICENSE
polyagamma-1.3.3/MANIFEST.in
polyagamma-1.3.3/Makefile
polyagamma-1.3.3/PKG-INFO
polyagamma-1.3.3/README.md
polyagamma-1.3.3/include/
polyagamma-1.3.3/include/pgm_density.h
polyagamma-1.3.3/include/pgm_random.h
polyagamma-1.3.3/polyagamma/
polyagamma-1.3.3/polyagamma/__init__.pxd
polyagamma-1.3.3/polyagamma/__init__.py
polyagamma-1.3.3/polyagamma/__init__.pyi
polyagamma-1.3.3/polyagamma/_polyagamma.pxd
polyagamma-1.3.3/polyagamma/_polyagamma.pyi
polyagamma-1.3.3/polyagamma/_polyagamma.pyx
polyagamma-1.3.3/polyagamma.egg-info/
polyagamma-1.3.3/polyagamma.egg-info/PKG-INFO
polyagamma-1.3.3/polyagamma.egg-info/SOURCES.txt
polyagamma-1.3.3/polyagamma.egg-info/dependency_links.txt
polyagamma-1.3.3/polyagamma.egg-info/requires.txt
polyagamma-1.3.3/polyagamma.egg-info/top_level.txt
polyagamma-1.3.3/pyproject.toml
polyagamma-1.3.3/requirements-dev.txt
polyagamma-1.3.3/setup.cfg
polyagamma-1.3.3/setup.py
polyagamma-1.3.3/src/
polyagamma-1.3.3/src/pgm_alternate.c
polyagamma-1.3.3/src/pgm_alternate_trunc_points.h
polyagamma-1.3.3/src/pgm_common.c
polyagamma-1.3.3/src/pgm_common.h
polyagamma-1.3.3/src/pgm_density.c
polyagamma-1.3.3/src/pgm_devroye.c
polyagamma-1.3.3/src/pgm_macros.h
polyagamma-1.3.3/src/pgm_random.c
polyagamma-1.3.3/src/pgm_saddle.c

Which seems to be more or less what you want, isn't it? (PKG-INFO and *.egg-info are always generated/included by setuptools).
(Note that some people in the community do recommend to add the tests folder to the sdist, but that is a different debate...).

You can use MANIFEST.in to exclude directories/files included by default with setuptools-scm. prune will remove the entire-directory (other directives can be found here: https://packaging.python.org/en/latest/guides/using-manifest-in/).

If you are still having problems with the build, please try removing the cache with:

rm -rf build dist polyagamma.egg-info

@abravalheri
Copy link
Contributor

I think the main problem is that all sections with package-data in the name refer only to files inside the package folder (in your case the package folder is polyagamma), not outside.

Files outside the package folder need to controlled via MANIFEST.in.

@zoj613
Copy link

zoj613 commented Jun 20, 2022

Hi @abravalheri , thanks a lot your proposed patch works and I can reproduce your results locally. The problem now is that the shared library cannot be used. It appears as though the import statement cannot find it: I get the following error after installing the package via the wheel and sdist:

>>> from polyagamma import random_polyagamma
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zoj/dev/polyagamma/polyagamma/__init__.py", line 1, in <module>
    from ._polyagamma import (
ModuleNotFoundError: No module named 'polyagamma._polyagamma'

The shared library is named polyagamma._polyagamma in the setup.py file and compiles successfully so im not sure why it cannot be imported with this installation. Do you have any suggestion for me to fix this?

I have two other tangential questions:

  1. Why was the zip-safe = false option removed in your patch? I thought this option was important when packing extension modules?
  2. How could I use setuptools-scm to dynamically specify the version number using a generated polyagamma/_version.py file? When I used
[tool.setuptools.dynamic]
version = {attr = "polyagamma._version.version"}

it did not seem to work?

@abravalheri
Copy link
Contributor

The shared library is named polyagamma._polyagamma in the setup.py file and compiles successfully so im not sure why it cannot be imported with this installation. Do you have any suggestion for me to fix this?

If I had to guess, I would say that this problem is very likely to be caused because you are trying to test a package that uses a "flat-layout" from the project root.

Have you tried cd-ing into a different folder before running the Python REPL? Everytime you run python, it will add the working directory ($PWD) in front of the sys.path. This means that the polyagamma folder from your project structure can be overshadowing the polyagamma folder from your site-packages installation. When Python tries to load the local polyagamma folder, it cannot find any compiled file there...

The flat-layout is very tricky to use and has several caveats, specially for testing. I always recommend people to use the src-layout.

  1. Why was the zip-safe = false option removed in your patch?

I don't think zip-safe has any effect on wheel installations via pip. It might be useful only if someone is importing your package directly via pkg_resources (from a zip), but I think no one does that anymore nowadays...

  1. How could I use setuptools-scm to dynamically specify the version number...?

I believe you can have setuptools-scm to write the version it gets from git/mercurial to a file via the write_to configuration parameter in the [tool.setuptools-scm] table. Please check the details on https://github.com/pypa/setuptools_scm/#configuration-parameters. If I am not wrong, when you do that you don't need to use the version = {attr = ...} directive (you still need dynamic = ["version"]).

@zoj613
Copy link

zoj613 commented Jun 20, 2022

@abravalheri you were right, It worked once I changed to another directory. Thank you so much for the help. I will look into automating versioning via setuptools-scm at a later stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement major Needs Discussion Issues where the implementation still needs to be discussed.
Projects
None yet
7 participants