Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform a more thorough check of package_data structure #1769

Merged
merged 2 commits into from
Jul 23, 2019

Conversation

dhimmel
Copy link
Contributor

@dhimmel dhimmel commented May 22, 2019

Summary of changes

package_data and exclude_package_data expect to receive a dictionary whose values are lists rather than strings. Previously, passing a str value did not cause any warning or error, which is dangerous because users accidentally input a glob/path that then gets iterated into individual characters.

Refs #1459

Pull Request Checklist

  • Changes have tests
  • News fragment added in changelog.d. See documentation for details

@dhimmel
Copy link
Contributor Author

dhimmel commented May 22, 2019

Any advice on what test I should write and where I should put it would be appreciated.

@dhimmel
Copy link
Contributor Author

dhimmel commented May 22, 2019

As of a86588f, the two messages are:

DistutilsSetupError: 'package_data' must be a dictionary mapping package names to lists of wildcard patterns
WARNING: 'package_data' must be a dictionary mapping package names to lists of wildcard patterns, but the value for 'manubot' is a str not list

@@ -314,12 +318,14 @@ def check_package_data(dist, attr, value):
iter(v)
except TypeError:
break
if isinstance(v, str):
distutils.log.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've generally been doing warnings.warn instead of distutils.log.warn these days.

@pganssle
Copy link
Member

@dhimmel Thank you for for looking into this. I think you'll want to add tests into test_dist. If you switch over to use a warning instead of distutils.log.warn, you can use the pytest.raises and pytest.warns context managers to test both of these scenarios.

@dhimmel
Copy link
Contributor Author

dhimmel commented May 22, 2019

Thanks @pganssle for the suggestions.

As far as the check_package_data implementation goes, I considered a few options:

  1. warn if dictionary value is string
  2. warn if dictionary value is not list
  3. error if dictionary value is string
  4. error if dictionary value is not list

I've implemented 1 because 3 & 4 could cause setup.py files that currently work to begin failing. Does anyone have a preference between 1 or 2. Note that 1 and 2 is also an option, where the warning message would note what type of value was received and that it wasn't a list.

@dhimmel dhimmel force-pushed the issue-1459 branch 2 times, most recently from f7d9412 to b8904e2 Compare May 23, 2019 17:34
@dhimmel
Copy link
Contributor Author

dhimmel commented May 23, 2019

As of b8904e2, Python 2 tests fail when a unicode str is passed as a dictionary key. This became apparent because test_dist.py invokes from __future__ import unicode_literals. I am guessing we do not want check_package_data data to fail if a dictionary like the following is provided:

    package_data = {
        u'hello': ['*.msg'],
    }

Therefore, I will update check_package_data to be more lenient. But please let me know if Python 2 should only accept str and not unicode keys?

@pganssle
Copy link
Member

@dhimmel Python 2 should definitely accept unicode strings for keys.

@benoit-pierre
Copy link
Member

Would re-using assert_string_list make more sense? Have you checked the implementation for check_extras?

@dhimmel
Copy link
Contributor Author

dhimmel commented May 23, 2019

Would re-using assert_string_list make more sense?

Thanks for the the suggestion. In 93af6a5, I update the docstring for assert_string_list because it seems to raise an error when value=None. One issue I have with assert_string_list is that the implementation doesn't actually test for a list of strings. For example, it would pass with a tuple or generator of strings. Is that intended or not... IDK?

The code for check_package_data could be simplified by outsourcing to a function like assert_string_list, However my worry is about backwards compatibility. Currently, setup.py files that use str (rather than list of str) values for package_data will not fail installation and the package may function properly, despite the unintended interpretation of the the package_data configuration. Because of this, I assumed we'd want to take the following approach:

  • keep all existing checks that raise errors, i.e. don't increase the ability to accept bad input
  • issue warnings for any input that doesn't conform to the documentation, but that was previously accepted

If we are to stick by those guidelines, we can add assert_string_list to check that the dict values are lists of strings, but we'd have to except the DistutilsSetupError from assert_string_list and convert it to a warning. So yes, we could add an extra check, but this wouldn't shorten/simplify check_package_data. Or @benoit-pierre do you think backwards incompatibility is okay, where specs that used to pass now error?

@benoit-pierre
Copy link
Member

Would re-using assert_string_list make more sense?

Thanks for the the suggestion. In 93af6a5, I update the docstring for assert_string_list because it seems to raise an error when value=None. One issue I have with assert_string_list is that the implementation doesn't actually test for a list of strings. For example, it would pass with a tuple or generator of strings. Is that intended or not... IDK?

That's OK, yes.

The code for check_package_data could be simplified by outsourcing to a function like assert_string_list, However my worry is about backwards compatibility. Currently, setup.py files that use str (rather than list of str) values for package_data will not fail installation and the package may function properly, despite the unintended interpretation of the the package_data configuration. Because of this, I assumed we'd want to take the following approach:

* keep all existing checks that raise errors, i.e. don't increase the ability to accept bad input

* issue warnings for any input that doesn't conform to the documentation, but that was previously accepted

If we are to stick by those guidelines, we can add assert_string_list to check that the dict values are lists of strings, but we'd have to except the DistutilsSetupError from assert_string_list and convert it to a warning. So yes, we could add an extra check, but this wouldn't shorten/simplify check_package_data. Or @benoit-pierre do you think backwards incompatibility is okay, where specs that used to pass now error?

IMHO, printing a warning instead of erroring out does not make sense if a keyword value is not valid. Passing a string instead of a list of string for package_data as always been invalid, resulting in missing package data, so I don't see it as a backward incompatible change to enforce it.

@dhimmel
Copy link
Contributor Author

dhimmel commented May 24, 2019

IMHO, printing a warning instead of erroring out does not make sense if a keyword value is not valid

Okay in 0435d7a I switch to raising an error for any invalid values.

Passing a string instead of a list of string for package_data as always been invalid, resulting in missing package data, so I don't see it as a backward incompatible change to enforce it.

Note that the old behavior may not always result in missing package data, because on Linux the path / could include everything? Furthermore, missing package data usually makes it so some specific functionality breaks. However, this PR will now make it so the entire installation breaks. Not saying I'm against raising an error, just that I want the risks to be clear.

That's OK, yes.

I'm not sure that all iterables should be considered lists for the purposes of assert_string_list. For example, if the value is a generator that gets consumed during these initial checks, then that is a problem?

@dhimmel dhimmel changed the title check_package_data: warn if value is str Perform a more thorough check of package_data structure May 29, 2019
@dhimmel
Copy link
Contributor Author

dhimmel commented May 29, 2019

@pganssle @benoit-pierre ready for re-review. Also advice appreciated on whether this PR should address the potential issue with consumable iterables described above.

@dhimmel
Copy link
Contributor Author

dhimmel commented Jun 24, 2019

@pganssle @benoit-pierre ready for re-review.

@benoit-pierre
Copy link
Member

benoit-pierre commented Jun 25, 2019

Sorry for the delay.

I think it would be more useful if the exception message in case of invalid key showed the key value.

Was it you intent to check the exception message in the tests, or to provide a custom message when the check fails (see https://docs.pytest.org/en/latest/deprecations.html#raises-message-deprecated)? IMHO the former is better, and the tests can be parameterized (with added check for the "not a dictionary" case):

 setuptools/dist.py            | 12 ++++-----
 setuptools/tests/test_dist.py | 58 ++++++++++++++++++++++++-------------------
 2 files changed, 38 insertions(+), 32 deletions(-)

diff --git i/setuptools/dist.py w/setuptools/dist.py
index 4e9190b3..8e3e1f92 100644
--- i/setuptools/dist.py
+++ w/setuptools/dist.py
@@ -306,15 +306,15 @@ def check_test_suite(dist, attr, value):
 
 def check_package_data(dist, attr, value):
     """Verify that value is a dictionary of package names to glob lists"""
-    msg = (
-        "{!r} must be a dictionary mapping package names to lists of "
-        "string wildcard patterns".format(attr)
-    )
     if not isinstance(value, dict):
-        raise DistutilsSetupError(msg)
+        raise DistutilsSetupError(
+            "{!r} must be a dictionary mapping package names to lists of "
+            "string wildcard patterns".format(attr))
     for k, v in value.items():
         if not isinstance(k, six.string_types):
-            raise DistutilsSetupError(msg)
+            raise DistutilsSetupError(
+                "{!r} must be strings (got {!r})".format(
+                    'keys of {!r} dict'.format(attr), k))
         assert_string_list(dist, 'values of {!r} dict'.format(attr), v)
 
 
diff --git i/setuptools/tests/test_dist.py w/setuptools/tests/test_dist.py
index 091b3687..38cc1039 100644
--- i/setuptools/tests/test_dist.py
+++ w/setuptools/tests/test_dist.py
@@ -3,6 +3,7 @@
 from __future__ import unicode_literals
 
 import io
+import re
 from distutils.errors import DistutilsSetupError
 from setuptools.dist import (
     _get_unpatched,
@@ -270,35 +271,40 @@ def test_maintainer_author(name, attrs, tmpdir):
             assert line in pkg_lines_set
 
 
-@pytest.mark.filterwarnings('error')
-def test_check_package_data_okay():
-    package_data = {
+CHECK_PACKAGE_DATA_TESTS = (
+    # Valid.
+    ({
         '': ['*.txt', '*.rst'],
         'hello': ['*.msg'],
-    }
-    assert check_package_data(None, 'package_data', package_data) is None
-
-
-def test_check_package_data_invalid_key():
-    package_data = {
+    }, None),
+    # Not a dictionary.
+    ((
+        ('', ['*.txt', '*.rst']),
+        ('hello', ['*.msg']),
+    ), (
+        "'package_data' must be a dictionary mapping package"
+        " names to lists of string wildcard patterns"
+    )),
+    # Invalid key type.
+    ({
         400: ['*.txt', '*.rst'],
-    }
-    expected_message = (
-        "'package_data' must be a dictionary mapping package names to lists "
-        "of string wildcard patterns"
-    )
-    with pytest.raises(DistutilsSetupError, message=expected_message):
-        check_package_data(None, 'package_data', package_data)
-
-
-def test_check_package_data_invalid_value():
-    """error if str values: https://github.com/pypa/setuptools/issues/1459"""
-    package_data = {
+    }, (
+        "\"keys of 'package_data' dict\" "
+        "must be strings (got 400)"
+    )),
+    # Invalid value type.
+    ({
         'hello': '*.msg',
-    }
-    expected_message = (
+    }, (
         "\"values of 'package_data' dict\" "
         "must be a list of strings (got '*.msg')"
-    )
-    with pytest.raises(DistutilsSetupError, message=expected_message):
-        check_package_data(None, 'package_data', package_data)
+    )),
+)
+
+@pytest.mark.parametrize('package_data, expected_message', CHECK_PACKAGE_DATA_TESTS)
+def test_check_package_data(package_data, expected_message):
+    if expected_message is None:
+        assert check_package_data(None, 'package_data', package_data) is None
+    else:
+        with pytest.raises(DistutilsSetupError, match=re.escape(expected_message)):
+            check_package_data(None, 'package_data', package_data)

@benoit-pierre
Copy link
Member

As for the problem with generators, there's only one option: raise an error if a generator is passed, as we can't change the API of the check_xxx functions (distutils territory) to validate and return the validated value (returning the result of consuming it if a generator is passed).

dhimmel pushed a commit to dhimmel/setuptools that referenced this pull request Jun 26, 2019
@dhimmel
Copy link
Contributor Author

dhimmel commented Jun 26, 2019

Thanks @benoit-pierre for the suggested patch. I applied it in d6fb012.

You were correct that I misunderstood the "message" parameter of pytest.raises as described in the deprecation notice:

It is a common mistake to think this parameter will match the exception message, while in fact it only serves to provide a custom message in case the pytest.raises check fails.

I will now go through everything again and think about whether it makes sense to address the generator issue in this PR.

"{!r} must be strings (got {!r})".format(
'keys of {!r} dict'.format(attr), k))
"keys of the {!r} dict must be strings (got {!r})"
.format(attr, k)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we might as well avoid the awkward nesting quoting when possible, such that this returns keys of 'package_data' dict must be strings (got 400) rather than "keys of 'package_data' dict" must be strings (got 400).

@dhimmel dhimmel force-pushed the issue-1459 branch 3 times, most recently from 903cfd2 to 993bbd6 Compare June 26, 2019 22:30
try:
# verify that value is not a single-use iterable
assert list(value) == list(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benoit-pierre does this seem like the right implementation to check whether value is consumable?

For my records, this was a helpful reference here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just simplify the whole thing and just explicitly check for a list/tuple with isinstance(value, (tuple, list)). @jaraco, @pganssle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using isinstance in this case is better than consuming the list and checking if it's consumed for sure.

I haven't had time to look at the rest of this PR and see where this assert_string_list is actually used - is there any chance that people would have legitimate reasons for passing a set to something that calls this?

Copy link
Contributor Author

@dhimmel dhimmel Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, assert_string_list is called by dist.check_nsp, dist.check_package_data (added by this PR), and directly for some of the following:

setuptools/setup.py

Lines 92 to 114 in 3adda96

"distutils.setup_keywords": [
"eager_resources = setuptools.dist:assert_string_list",
"namespace_packages = setuptools.dist:check_nsp",
"extras_require = setuptools.dist:check_extras",
"install_requires = setuptools.dist:check_requirements",
"tests_require = setuptools.dist:check_requirements",
"setup_requires = setuptools.dist:check_requirements",
"python_requires = setuptools.dist:check_specifier",
"entry_points = setuptools.dist:check_entry_points",
"test_suite = setuptools.dist:check_test_suite",
"zip_safe = setuptools.dist:assert_bool",
"package_data = setuptools.dist:check_package_data",
"exclude_package_data = setuptools.dist:check_package_data",
"include_package_data = setuptools.dist:assert_bool",
"packages = setuptools.dist:check_packages",
"dependency_links = setuptools.dist:assert_string_list",
"test_loader = setuptools.dist:check_importable",
"test_runner = setuptools.dist:check_importable",
"use_2to3 = setuptools.dist:assert_bool",
"convert_2to3_doctests = setuptools.dist:assert_string_list",
"use_2to3_fixers = setuptools.dist:assert_string_list",
"use_2to3_exclude_fixers = setuptools.dist:assert_string_list",
],

def check_nsp(dist, attr, value):
"""Verify that namespace packages are valid"""
ns_packages = value
assert_string_list(dist, attr, ns_packages)

So I think the total list of setup keywords that end up hitting assert_string_list is:

  • eager_resources
  • namespace_packages
  • package_data
  • exclude_package_data
  • dependency_links
  • convert_2to3_doctests
  • use_2to3_fixers
  • use_2to3_exclude_fixers

Is the behavior of any of these setup keywords order dependent? If so, we probably want to not allow sets of strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just simplify the whole thing and just explicitly check for a list/tuple with isinstance(value, (tuple, list))

Implemented in d9035ef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dhimmel added a commit to dhimmel/setuptools that referenced this pull request Jul 3, 2019
value=None raises TypeError

DistutilsSetupError: 2 must be a list of strings (got None)
@benoit-pierre
Copy link
Member

I have squashed the history and simplified the news fragment. I think this is good to go, but I'll wait a few days before merging in case @pganssle / @jaraco disagree.

Ensure the dictionary values are lists/tuples of strings.

Fix pypa#1459.
@dquist
Copy link

dquist commented Aug 14, 2019

While I welcome better checks for package_data, this was a breaking change because packages that were relying on the previous behavior and were installing fine with 3.7.3 are now unable to be installed with python 3.7.4 and I had to run around trying to figure out why all of our CI pipelines were suddenly failing. Fortunately I was able to quickly patch the offending upstream project, but others may not be so fortunate.

This change should not have been introduced into a patch release.

@pganssle
Copy link
Member

While I welcome better checks for package_data, this was a breaking change because packages that were relying on the previous behavior and were installing fine with 3.7.3 are now unable to be installed with python 3.7.4 and I had to run around trying to figure out why all of our CI pipelines were suddenly failing.

I don't think this is anything to do with 3.7.3 vs 3.7.4, setuptools is an independent project from CPython. This was released as part of setuptools version 41.1.0, which was a bugfix release, not a patch release.

My understanding is that this is backwards compatible because this only raises an error on already broken workflows - i.e. it may have been silently failing and now it's loudly failing.

@dquist
Copy link

dquist commented Aug 19, 2019

I'm trying to understand what broke then, because I had several CI jobs that relied on the 'python:3.7 image' that have been working for months, then suddenly they started failing last week once the python images updated to pip 19.2.2.

With the updated docker images, my python repo with the incorrect package_data definition wouldn't install anymore but it would work again if I reverted back to the python 3.7.3 image.

What determines the version of setuptools that pip uses. Is it bundled? If so, I can't see a reference anywhere to the v41.1 release.

@dhimmel
Copy link
Contributor Author

dhimmel commented Aug 20, 2019

@dquist my guess is that the different docker images have different versions of setuptools included. Unless something is pinning setuptool<41.1.0, it seems to me that the more stringent package_data checking from this PR will eventually get included in most python Docker images when they are rebuilt.

I do agree that this change is not entirely backwards compatible. Before, packages with a broken setup.py could be installed and would mostly work since package_data is often non-essential. Now it seems these package cannot be installed with pip install because an error will be thrown during checks.

I wonder if one solution would be an option like pip install --skip-setup-checks that would skip performing these checks. It would be useful for tying to install packages with broken setup.py files that despite being broken still work for certain applications. I don't think pip has this option, and I don't know much about the relationship between pip and setuptools, so not sure where this could be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants