Skip to content

Latest commit

 

History

History
556 lines (415 loc) · 18 KB

datafiles.rst

File metadata and controls

556 lines (415 loc) · 18 KB

Data Files Support

Old packaging installation methods in the Python ecosystem have traditionally allowed installation of "data files", which are placed in a platform-specific location. However, the most common use case for data files distributed with a package is for use by the package, usually by including the data files inside the package directory.

Setuptools focuses on this most common type of data files and offers three ways of specifying which files should be included in your packages, as described in the following sections.

include_package_data

First, you can simply use the include_package_data keyword. For example, if the package tree looks like this:

project_root_directory
├── setup.py        # and/or setup.cfg, pyproject.toml
└── src
    └── mypkg
        ├── __init__.py
        ├── data1.rst
        ├── data2.rst
        ├── data1.txt
        └── data2.txt

and you supply this configuration:

setup.cfg

[options]
# ...
packages = find:
package_dir =
    = src
include_package_data = True

[options.packages.find]
where = src

setup.py

from setuptools import setup, find_packages
setup(
    # ...,
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    include_package_data=True
)

pyproject.toml (BETA)1

[tool.setuptools]
# ...
# By default, include-package-data is true in pyproject.toml, so you do
# NOT have to specify this line.
include-package-data = true

[tool.setuptools.packages.find]
where = ["src"]

then all the .txt and .rst files will be automatically installed with your package, provided:

  1. These files are included via the MANIFEST.in_ file, like so:

    include src/mypkg/*.txt
    include src/mypkg/*.rst
  2. OR, they are being tracked by a revision control system such as Git, Mercurial or SVN, and you have configured an appropriate plugin such as setuptools-scm or setuptools-svn. (See the section below on Adding Support for Revision Control Systems for information on how to write such plugins.)

package_data

By default, include_package_data considers all non .py files found inside the package directory (src/mypkg in this case) as data files, and includes those that satisfy (at least) one of the above two conditions into the source distribution, and consequently in the installation of your package. If you want finer-grained control over what files are included, then you can also use the package_data keyword. For example, if the package tree looks like this:

project_root_directory
├── setup.py        # and/or setup.cfg, pyproject.toml
└── src
    └── mypkg
        ├── __init__.py
        ├── data1.rst
        ├── data2.rst
        ├── data1.txt
        └── data2.txt

then you can use the following configuration to capture the .txt and .rst files as data files:

setup.cfg

[options]
# ...
packages = find:
package_dir =
    = src

[options.packages.find]
where = src

[options.package_data]
mypkg =
    *.txt
    *.rst

setup.py

from setuptools import setup, find_packages
setup(
    # ...,
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    package_data={"mypkg": ["*.txt", "*.rst"]}
)

pyproject.toml (BETA)2

[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.package-data]
mypkg = ["*.txt", "*.rst"]

The package_data argument is a dictionary that maps from package names to lists of glob patterns. Note that the data files specified using the package_data option neither require to be included within a MANIFEST.in_ file, nor require to be added by a revision control system plugin.

Note

If your glob patterns use paths, you must use a forward slash (/) as the path separator, even if you are on Windows. Setuptools automatically converts slashes to appropriate platform-specific separators at build time.

Note

Glob patterns do not automatically match dotfiles (directory or file names starting with a dot (.)). To include such files, you must explicitly start the pattern with a dot, e.g. .* to match .gitignore.

If you have multiple top-level packages and a common pattern of data files for all these packages, for example:

project_root_directory
├── setup.py        # and/or setup.cfg, pyproject.toml
└── src
    ├── mypkg1
    │   ├── data1.rst
    │   ├── data1.txt
    │   └── __init__.py
    └── mypkg2
        ├── data2.txt
        └── __init__.py

Here, both packages mypkg1 and mypkg2 share a common pattern of having .txt data files. However, only mypkg1 has .rst data files. In such a case, if you want to use the package_data option, the following configuration will work:

setup.cfg

[options]
packages = find:
package_dir =
    = src

[options.packages.find]
where = src

[options.package_data]
* =
  *.txt
mypkg1 =
  data1.rst

setup.py

from setuptools import setup, find_packages
setup(
    # ...,
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    package_data={"": ["*.txt"], "mypkg1": ["data1.rst"]},
)

pyproject.toml (BETA)3

[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.package-data]
"*" = ["*.txt"]
mypkg1 = ["data1.rst"]

Notice that if you list patterns in package_data under the empty string "" in setup.py, and the asterisk * in setup.cfg and pyproject.toml, these patterns are used to find files in every package. For example, we use "" or * to indicate that the .txt files from all packages should be captured as data files. Also note how we can continue to specify patterns for individual packages, i.e. we specify that data1.rst from mypkg1 alone should be captured as well.

Note

When building an sdist, the datafiles are also drawn from the package_name.egg-info/SOURCES.txt file, so make sure that this is removed if the setup.py package_data list is updated before calling setup.py.

Note

If using the include_package_data argument, files specified by package_data will not be automatically added to the manifest unless they are listed in the MANIFEST.in_ file or by a plugin like setuptools-scm or setuptools-svn.

exclude_package_data

Sometimes, the include_package_data or package_data options alone aren't sufficient to precisely define what files you want included. For example, consider a scenario where you have include_package_data=True, and you are using a revision control system with an appropriate plugin. Sometimes developers add directory-specific marker files (such as .gitignore, .gitkeep, .gitattributes, or .hgignore), these files are probably being tracked by the revision control system, and therefore by default they will be included when the package is installed.

Supposing you want to prevent these files from being included in the installation (they are not relevant to Python or the package), then you could use the exclude_package_data option:

setup.cfg

[options]
# ...
packages = find:
package_dir =
    = src
include_package_data = True

[options.packages.find]
where = src

[options.exclude_package_data]
mypkg =
    .gitattributes

setup.py

from setuptools import setup, find_packages
setup(
    # ...,
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    include_package_data=True,
    exclude_package_data={"mypkg": [".gitattributes"]},
)

pyproject.toml (BETA)4

[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.exclude-package-data]
mypkg = [".gitattributes"]

The exclude_package_data option is a dictionary mapping package names to lists of wildcard patterns, just like the package_data option. And, just as with that option, you can use the empty string key "" in setup.py and the asterisk * in setup.cfg and pyproject.toml to match all top-level packages.

Any files that match these patterns will be excluded from installation, even if they were listed in package_data or were included as a result of using include_package_data.

Subdirectory for Data Files

A common pattern is where some (or all) of the data files are placed under a separate subdirectory. For example:

project_root_directory
├── setup.py        # and/or setup.cfg, pyproject.toml
└── src
    └── mypkg
        ├── data
        │   ├── data1.rst
        │   └── data2.rst
        ├── __init__.py
        ├── data1.txt
        └── data2.txt

Here, the .rst files are placed under a data subdirectory inside mypkg, while the .txt files are directly under mypkg.

In this case, the recommended approach is to treat data as a namespace package (refer 420). With package_data, the configuration might look like this:

setup.cfg

[options]
# ...
packages = find_namespace:
package_dir =
    = src

[options.packages.find]
where = src

[options.package_data]
mypkg =
    *.txt
mypkg.data =
    *.rst

setup.py

from setuptools import setup, find_namespace_packages
setup(
    # ...,
    packages=find_namespace_packages(where="src"),
    package_dir={"": "src"},
    package_data={
        "mypkg": ["*.txt"],
        "mypkg.data": ["*.rst"],
    }
)

pyproject.toml (BETA)5

[tool.setuptools.packages.find]
# scanning for namespace packages is true by default in pyproject.toml, so
# you do NOT need to include the following line.
namespaces = true
where = ["src"]

[tool.setuptools.package-data]
mypkg = ["*.txt"]
"mypkg.data" = ["*.rst"]

In other words, we allow Setuptools to scan for namespace packages in the src directory, which enables the data directory to be identified, and then, we separately specify data files for the root package mypkg, and the namespace package data under the package mypkg.

With include_package_data the configuration is simpler: you simply need to enable scanning of namespace packages in the src directory and the rest is handled by Setuptools.

setup.cfg

[options]
packages = find_namespace:
package_dir =
    = src
include_package_data = True

[options.packages.find]
where = src

setup.py

from setuptools import setup, find_namespace_packages
setup(
    # ... ,
    packages=find_namespace_packages(where="src"),
    package_dir={"": "src"},
    include_package_data=True,
)

pyproject.toml (BETA)6

[tool.setuptools]
# ...
# By default, include-package-data is true in pyproject.toml, so you do
# NOT have to specify this line.
include-package-data = true

[tool.setuptools.packages.find]
# scanning for namespace packages is true by default in pyproject.toml, so
# you need NOT include the following line.
namespaces = true
where = ["src"]

Summary

In summary, the three options allow you to:

include_package_data

Accept all data files and directories matched by MANIFEST.in_ or added by a plugin <Adding Support for Revision Control Systems>.

package_data

Specify additional patterns to match files that may or may not be matched by MANIFEST.in_ or added by a plugin <Adding Support for Revision Control Systems>.

exclude_package_data

Specify patterns for data files and directories that should not be included when a package is installed, even if they would otherwise have been included due to the use of the preceding options.

Note

Due to the way the build process works, a data file that you include in your project and then stop including may be "orphaned" in your project's build directories, requiring you to run setup.py clean --all to fully remove them. This may also be important for your users and contributors if they track intermediate revisions of your project using Subversion; be sure to let them know when you make changes that remove files from inclusion so they can run setup.py clean --all.

Accessing Data Files at Runtime

Typically, existing programs manipulate a package's __file__ attribute in order to find the location of data files. For example, if you have a structure like this:

project_root_directory
├── setup.py        # and/or setup.cfg, pyproject.toml
└── src
    └── mypkg
        ├── data
        │   └── data1.txt
        ├── __init__.py
        └── foo.py

Then, in mypkg/foo.py, you may try something like this in order to access mypkg/data/data1.txt:

import os
data_path = os.path.join(os.path.dirname(__file__), 'data', 'data1.txt')
with open(data_path, 'r') as data_file:
     ...

However, this manipulation isn't compatible with 302-based import hooks, including importing from zip files and Python Eggs. It is strongly recommended that, if you are using data files, you should use importlib.resources to access them. In this case, you would do something like this:

from importlib.resources import files
data_text = files('mypkg.data').joinpath('data1.txt').read_text()

importlib.resources was added to Python 3.7. However, the API illustrated in this code (using files()) was added only in Python 3.9,7 and support for accessing data files via namespace packages was added only in Python 3.108 (the data subdirectory is a namespace package under the root package mypkg). Therefore, you may find this code to work only in Python 3.10 (and above). For other versions of Python, you are recommended to use the importlib-resources backport which provides the latest version of this library. In this case, the only change that has to be made to the above code is to replace importlib.resources with importlib_resources, i.e.

from importlib_resources import files
...

See importlib-resources:using for detailed instructions.

Tip

Files inside the package directory should be read-only to avoid a series of common problems (e.g. when multiple users share a common Python installation, when the package is loaded from a zip file, or when multiple instances of a Python application run in parallel).

If your Python package needs to write to a file for shared data or configuration, you can use standard platform/OS-specific system directories, such as ~/.local/config/$appname or /usr/share/$appname/$version (Linux specific)9. A common approach is to add a read-only template file to the package directory that is then copied to the correct system directory if no pre-existing file is found.

Non-Package Data Files

Historically, setuptools by way of easy_install would encapsulate data files from the distribution into the egg (see the old docs). As eggs are deprecated and pip-based installs fall back to the platform-specific location for installing data files, there is no supported facility to reliably retrieve these resources.

Instead, the PyPA recommends that any data files you wish to be accessible at run time be included inside the package.



  1. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  2. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  3. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  4. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  5. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  6. Support for adding build configuration options via the [tool.setuptools] table in the pyproject.toml file is still in beta stage. See /userguide/pyproject_config.

  7. Reference: https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy

  8. Reference: python/importlib_resources#196 (comment)

  9. These locations can be discovered with the help of third-party libraries such as platformdirs.