Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2898] Include nested pipeline configuration in packaging #992

Merged
merged 9 commits into from Nov 2, 2021

Conversation

merelcht
Copy link
Member

Description

Addresses the issue found in #914 where nested pipeline configuration files aren't included when the pipeline is packaged.

Development notes

With this change configuration files nested up to two levels deep from /parameters will be included. This is consistent with how we handle catalog, credential and parameter files.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change and added my name to the list of supporting contributions in the RELEASE.md file
  • Added tests to cover my changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

@merelcht merelcht requested a review from idanov as a code owner October 26, 2021 16:17
@merelcht merelcht self-assigned this Oct 26, 2021
@@ -942,6 +942,7 @@ def _generate_setup_file(
"config/**/parameters*",
"config/parameters*/**",
"config/parameters*/**/*",
"config/parameters*/**/**/*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in saying if we want to go another level deep we need to do the same thing?

Why not do something dynamic for a bigger spectrum?

 [ f"config/parameters*/{'**/'*x}*" for x in range(0,5)]

>>> ['config/parameters*/*',
>>> 'config/parameters*/**/*',
>>> 'config/parameters*/**/**/*',
>>> 'config/parameters*/**/**/**/*',
>>> 'config/parameters*/**/**/**/**/*']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct. This implementation is consistent with how deep the nesting goes for parameters, credentials and catalog files. I guess the question is how deep we need to go and how far do we want to support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** already looks arbitrarily deep recursively: https://stackoverflow.com/questions/32604656/what-is-the-glob-character.

So actually I'm not quite sure why this doesn't already work as we want 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we want f"parameters*/**/{pipeline_name}/**/*" rather than what we have currently. So basically your change on line 684. That should match something that has a directory called pipeline_name somewhere in the directory hierarchy, regardless of how deep it is. So all these should be found:

parameters*/{pipeline_name}/*
parameters*/a/{pipeline_name}/*
parameters*/{pipeline_name}/b/*
parameters*/a/{pipeline_name}/b/*
parameters*/a/b/c/d/{pipeline_name}/e/f/g/h/i/j/*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I made that change as well here: https://github.com/quantumblacklabs/kedro/blob/a134ef72bf8440800bd5e58250a18e02150bd21a/kedro/framework/cli/pipeline.py#L684 @AntonyMilneQB

However, the extra line in generate_setup_file is also needed, because otherwise the config will still not be included in the final package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like actually recursive globs aren't supported by setuptools, so even though we use the ** syntax it doesn't do anything different from just the single *. "The patterns here are glob-style patterns: * matches zero or more regular filename characters (on Unix, everything except forward slash; on Windows, everything except backslash and colon);" https://packaging.python.org/guides/using-manifest-in/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that is very annoying 😬 I'm a bit surprised our tests were already passing if ** doesn't work in setup.py at all, but I guess we must have not been testing that case.

Looking at the code I think it would be best if we stopped using globs in _generate_setup_file altogether and just use configs_to_package = _find_config_files as the single source of truth. Looks like conf_target should already have all the right parameters files in it, so all we should need to do is list recursively all (no need for any pattern matching) the files in that directory, pass that into _generate_setup_file and add README.md to the list?

Would be good to get @lorenabalan's thoughts on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with the above, if setuptools is not doing what we want, let's compute the list of config files ourselves and just pass them to the setup.py file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation as discussed.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -334,6 +334,52 @@ def test_package_modular_pipeline_with_nested_parameters(
assert "retail/config/parameters/retail.yml" in wheel_contents
assert "retail/config/parameters/retail_banking.yml" not in wheel_contents

def test_package_pipeline_with_deep_nested_parameters(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth updating this test (or making a new one) to cover a more deeply nested version, just to check our glob is working as expected? Like "conf" / "base" / "parameters" / "a" / "b" / "c"/ "d"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@datajoely
Copy link
Contributor

Looking good!

@merelcht merelcht merged commit 940d150 into master Nov 2, 2021
@merelcht merelcht deleted the KED-2898-pipeline-package-conf branch November 2, 2021 14:23
lvijnck pushed a commit to lvijnck/kedro that referenced this pull request Apr 7, 2022
…org#992)

Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants