Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_ext --inplace overwrites data files if they have been installed via MANIFEST.in #886

Open
vyasr opened this issue Mar 16, 2023 · 4 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Mar 16, 2023

This bug is a fairly edge case scenario, but it's nasty enough that I think it would be worthwhile to fix if at all possible since the results are potentially quite bad and involve file corruption. Now that I have an MRE I am attempting to zero in on a root cause, but would appreciate insights if someone else knows what to do without further inspection. tl;dr setup.py build_ext --inplace appears to be unsafe for use after setup.py install if a MANIFEST.in file is present and points to any file that may be modified.

Here is a gist with a project that demonstrates the basic problem (note that the _hello.pyx and __init__.py files should be placed in a hello subdirectory). First run python setup.py install. Now, make any edit to _hello.pyx (I include x=1 twice, simplest change is to comment out one line) and run python setup.py build_ext --inplace. The edit that was just made will vanish.

The problem appears to be that any file that is included by the MANIFEST.in is being copied into scikit-build's cmake-install directory by a setup.py install command, while build_ext --inplace copies files from this install tree back into the source tree, again respecting MANIFEST.in. Only setup.py install actually copies the current version of the file into cmake-install. If no install command is ever run, then there is nothing to copy and everything seems to work fine i.e. it is completely fine if you only ever use build_ext --inplace. However, once install is run even once, the files listed in the manifest exist in the install tree and are only updated by subsequent install commands. Meanwhile, every subsequent build_ext --inplace copies the files from the install tree back into the source directory. The result is that once install has been run, build_ext --inplace is no longer safe to use because it will overwrite all local changes with the last state in which the file was installed.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 16, 2023

It looks like specifying package_data=... instead of using include_package_data with MANIFEST.in works as expected. At minimum, the files are no longer incorrectly overwritten, and I verified that all the desired files are correctly included when creating a wheel using pip wheel.

scikit-build's approach to hooking setuptools for autogenerating a MANIFEST also seems to work. Since that manifest generation is a setuptools hook, it happens inside the setuptools.setup downstream of where scikit-build manually processes manifest entries so it is not susceptible to the same bug.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 16, 2023

The underlying problem is coming from this code block populating the package_data with the contents of MANIFEST.in, followed by this subsequent block copying those files from the install tree into the source tree when developer mode is enabled. IIUC the problem is that this copy is happening without the necessary copy from the source tree into the install tree that happens on installation when the files have been modified. I don't have enough context to know exactly how this should be fixed, but happy to consult on possible solutions or make a PR with some guidance on the appropriate behavior. There are probably edge cases that I'm not quite understanding w.r.t. why the current sequence makes sense.

rapids-bot bot pushed a commit to rapidsai/rmm that referenced this issue Mar 16, 2023
… for wheels (#1233)

Using MANIFEST.in currently runs into a pretty nasty scikit-build bug (scikit-build/scikit-build#886) that results in any file included by the manifest being copied from the install tree back into the source tree whenever an in place build occurs after an install, overwriting any local changes. We need an alternative approach to ensure that all necessary files are included in built packages. There are two types:
- sdists: scikit-build automatically generates a manifest during sdist generation if we don't provide one, and that manifest is reliably complete. It contains all files needed for a source build up to the rmm C++ code (which has always been true and is something we can come back to improving later if desired).
- wheels: The autogenerated manifest is not used during wheel generation because the manifest generation hook is not invoked during wheel builds, so to include data in the wheels we must provide the `package_data` argument to `setup`. In this case we do not need to include CMake or pyx files because the result does not need to be possible to build from, it just needs pxd files for other packages to cimport if desired.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1233
rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Mar 17, 2023
… for wheels (#1348)

Using MANIFEST.in currently runs into a pretty nasty scikit-build bug (scikit-build/scikit-build#886) that results in any file included by the manifest being copied from the install tree back into the source tree whenever an in place build occurs after an install, overwriting any local changes. We need an alternative approach to ensure that all necessary files are included in built packages. There are two types:
- sdists: scikit-build automatically generates a manifest during sdist generation if we don't provide one, and that manifest is reliably complete. It contains all files needed for a source build up to the raft C++ code (which has always been true and is something we can come back to improving later if desired).
- wheels: The autogenerated manifest is not used during wheel generation because the manifest generation hook is not invoked during wheel builds, so to include data in the wheels we must provide the `package_data` argument to `setup`. In this case we do not need to include CMake or pyx files because the result does not need to be possible to build from, it just needs pxd files for other packages to cimport if desired.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ben Frederickson (https://github.com/benfred)

URL: #1348
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 17, 2023
… for wheels (#12960)

Using MANIFEST.in currently runs into a pretty nasty scikit-build bug (scikit-build/scikit-build#886) that results in any file included by the manifest being copied from the install tree back into the source tree whenever an in place build occurs after an install, overwriting any local changes. We need an alternative approach to ensure that all necessary files are included in built packages. There are two types:
- sdists: scikit-build automatically generates a manifest during sdist generation if we don't provide one, and that manifest is reliably complete. It contains all files needed for a source build up to the cudf C++ code (which has always been true and is something we can come back to improving later if desired).
- wheels: The autogenerated manifest is not used during wheel generation because the manifest generation hook is not invoked during wheel builds, so to include data in the wheels we must provide the `package_data` argument to `setup`. In this case we do not need to include CMake or pyx files because the result does not need to be possible to build from, it just needs pxd files for other packages to cimport if desired.

I also reverted #12945, which was a stopgap solution to avoid this underlying problem. That change would have caused import issues inside the python/cudf directory when installing (the lack of an inplace build would have made the source tree unimportable) so this fix removes that minor limitation introduced in that PR.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #12960
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Mar 17, 2023
… for wheels (#5278)

Using MANIFEST.in currently runs into a pretty nasty scikit-build bug (scikit-build/scikit-build#886) that results in any file included by the manifest being copied from the install tree back into the source tree whenever an in place build occurs after an install, overwriting any local changes. We need an alternative approach to ensure that all necessary files are included in built packages. There are two types:
- sdists: scikit-build automatically generates a manifest during sdist generation if we don't provide one, and that manifest is reliably complete. It contains all files needed for a source build up to the cuml C++ code (which has always been true and is something we can come back to improving later if desired).
- wheels: The autogenerated manifest is not used during wheel generation because the manifest generation hook is not invoked during wheel builds, so to include data in the wheels we must provide the `package_data` argument to `setup`. In this case we do not need to include CMake or pyx files because the result does not need to be possible to build from, it just needs pxd files for other packages to cimport if desired.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #5278
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Mar 17, 2023
… for wheels (#3342)

Using MANIFEST.in currently runs into a pretty nasty scikit-build bug (scikit-build/scikit-build#886) that results in any file included by the manifest being copied from the install tree back into the source tree whenever an in place build occurs after an install, overwriting any local changes. We need an alternative approach to ensure that all necessary files are included in built packages. There are two types:
- sdists: scikit-build automatically generates a manifest during sdist generation if we don't provide one, and that manifest is reliably complete. It contains all files needed for a source build up to the cugraph C++ code (which has always been true and is something we can come back to improving later if desired).
- wheels: The autogenerated manifest is not used during wheel generation because the manifest generation hook is not invoked during wheel builds, so to include data in the wheels we must provide the `package_data` argument to `setup`. In this case we do not need to include CMake or pyx files because the result does not need to be possible to build from, it just needs pxd files for other packages to cimport if desired.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)

URL: #3342
@henryiii
Copy link
Contributor

I want to work toward making scikit-build behaves as close as possible to the way setuptools works by default. I'm mostly planning on doing this via scikit-build-core's setuptools support which will replace scikit-build's code in the future, but happy to slowly work toward improving this too. Packages and data files are really tricky in scikit-build.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 13, 2023

Totally understand the challenges here, and I agree with the strategy. IMHO this particular bug seems worth prioritizing a bit higher than waiting on scikit-build-core though. It's a recipe for potentially significant losses of local work for developers.

I didn't manage to find a sufficient root cause to determine an optimal solution, but I suspect that this particular case may have potential patches. Do files from MANIFEST.in actually ever need to be copied back to the source tree? I don't recall if that was being done by setuptools or scikit-build, if the latter maybe that could simply be disabled? Alternatively, perhaps build_ext could always install MANIFEST.in files if an install dir exists, even though it isn't supposed to be installing?

If there isn't an easy workaround, perhaps certain code paths should simply be disabled and throw errors. Overwriting local changes seems far worse to me than simply having scikit-build throw an error saying "build_ext is not supported after install when using MANIFEST.in" (assuming that situation can be robustly detected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@vyasr @henryiii and others