Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2898] Include nested pipeline configuration in packaging #992

Merged
merged 9 commits into from Nov 2, 2021
1 change: 1 addition & 0 deletions RELEASE.md
Expand Up @@ -20,6 +20,7 @@
* Relaxed the bounds on the `plotly` requirement for `plotly.PlotlyDataSet`.
* `kedro pipeline package <pipeline>` now raises an error if the `<pipeline>` argument doesn't look like a valid Python module path (e.g. has `/` instead of `.`).
* Fixed slow startup because of catalog processing by reducing the exponential growth of extra processing during `_FrozenDatasets` creations.
* Fixed an issue where nested pipeline configuration was not included in the packaged pipeline.

## Minor breaking changes to the API

Expand Down
3 changes: 2 additions & 1 deletion kedro/framework/cli/pipeline.py
Expand Up @@ -681,7 +681,7 @@ def _package_pipeline( # pylint: disable=too-many-arguments
# config files not to confuse users and avoid useless file copies
configs_to_package = _find_config_files(
artifacts_to_package.pipeline_conf,
[f"parameters*/**/{pipeline_name}.yml", f"parameters*/**/{pipeline_name}/*"],
[f"parameters*/**/{pipeline_name}.yml", f"parameters*/**/{pipeline_name}/**/*"],
)

source_paths = (
Expand Down Expand Up @@ -942,6 +942,7 @@ def _generate_setup_file(
"config/**/parameters*",
"config/parameters*/**",
"config/parameters*/**/*",
"config/parameters*/**/**/*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in saying if we want to go another level deep we need to do the same thing?

Why not do something dynamic for a bigger spectrum?

 [ f"config/parameters*/{'**/'*x}*" for x in range(0,5)]

>>> ['config/parameters*/*',
>>> 'config/parameters*/**/*',
>>> 'config/parameters*/**/**/*',
>>> 'config/parameters*/**/**/**/*',
>>> 'config/parameters*/**/**/**/**/*']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct. This implementation is consistent with how deep the nesting goes for parameters, credentials and catalog files. I guess the question is how deep we need to go and how far do we want to support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** already looks arbitrarily deep recursively: https://stackoverflow.com/questions/32604656/what-is-the-glob-character.

So actually I'm not quite sure why this doesn't already work as we want 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we want f"parameters*/**/{pipeline_name}/**/*" rather than what we have currently. So basically your change on line 684. That should match something that has a directory called pipeline_name somewhere in the directory hierarchy, regardless of how deep it is. So all these should be found:

parameters*/{pipeline_name}/*
parameters*/a/{pipeline_name}/*
parameters*/{pipeline_name}/b/*
parameters*/a/{pipeline_name}/b/*
parameters*/a/b/c/d/{pipeline_name}/e/f/g/h/i/j/*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I made that change as well here: https://github.com/quantumblacklabs/kedro/blob/a134ef72bf8440800bd5e58250a18e02150bd21a/kedro/framework/cli/pipeline.py#L684 @AntonyMilneQB

However, the extra line in generate_setup_file is also needed, because otherwise the config will still not be included in the final package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like actually recursive globs aren't supported by setuptools, so even though we use the ** syntax it doesn't do anything different from just the single *. "The patterns here are glob-style patterns: * matches zero or more regular filename characters (on Unix, everything except forward slash; on Windows, everything except backslash and colon);" https://packaging.python.org/guides/using-manifest-in/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that is very annoying 😬 I'm a bit surprised our tests were already passing if ** doesn't work in setup.py at all, but I guess we must have not been testing that case.

Looking at the code I think it would be best if we stopped using globs in _generate_setup_file altogether and just use configs_to_package = _find_config_files as the single source of truth. Looks like conf_target should already have all the right parameters files in it, so all we should need to do is list recursively all (no need for any pattern matching) the files in that directory, pass that into _generate_setup_file and add README.md to the list?

Would be good to get @lorenabalan's thoughts on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with the above, if setuptools is not doing what we want, let's compute the list of config files ourselves and just pass them to the setup.py file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the implementation as discussed.

]
}
setup_file_context = dict(
Expand Down
46 changes: 46 additions & 0 deletions tests/framework/cli/pipeline/test_pipeline_package.py
Expand Up @@ -312,6 +312,52 @@ def test_package_modular_pipeline_with_nested_parameters(
assert "retail/config/parameters/retail.yml" in wheel_contents
assert "retail/config/parameters/retail_banking.yml" not in wheel_contents

def test_package_pipeline_with_deep_nested_parameters(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth updating this test (or making a new one) to cover a more deeply nested version, just to check our glob is working as expected? Like "conf" / "base" / "parameters" / "a" / "b" / "c"/ "d"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

self, fake_repo_path, fake_project_cli, fake_metadata
):
CliRunner().invoke(
fake_project_cli, ["pipeline", "create", "retail"], obj=fake_metadata
)
deep_nested_param_path = Path(
fake_repo_path / "conf" / "base" / "parameters" / "deep" / "retail"
)
deep_nested_param_path.mkdir(parents=True, exist_ok=True)
(deep_nested_param_path / "params1.yml").touch()

deep_nested_param_path2 = Path(
fake_repo_path / "conf" / "base" / "parameters" / "retail" / "deep"
)
deep_nested_param_path2.mkdir(parents=True, exist_ok=True)
(deep_nested_param_path2 / "params1.yml").touch()

deep_nested_param_path3 = Path(
fake_repo_path / "conf" / "base" / "parameters" / "deep"
)
deep_nested_param_path3.mkdir(parents=True, exist_ok=True)
(deep_nested_param_path3 / "retail.yml").touch()

result = CliRunner().invoke(
fake_project_cli, ["pipeline", "package", "retail"], obj=fake_metadata
)

assert result.exit_code == 0
assert "Pipeline `retail` packaged!" in result.output

wheel_location = fake_repo_path / "src" / "dist"
assert f"Location: {wheel_location}" in result.output

wheel_name = _get_wheel_name(name="retail", version="0.1")
wheel_file = wheel_location / wheel_name
assert wheel_file.is_file()
assert len(list(wheel_location.iterdir())) == 1

# pylint: disable=consider-using-with
wheel_contents = set(ZipFile(str(wheel_file)).namelist())
assert "retail/config/parameters/deep/retail/params1.yml" in wheel_contents
assert "retail/config/parameters/retail/deep/params1.yml" in wheel_contents
assert "retail/config/parameters/retail.yml" in wheel_contents
assert "retail/config/parameters/deep/retail.yml" in wheel_contents

def test_pipeline_package_version(
self, fake_repo_path, fake_package_path, fake_project_cli, fake_metadata
):
Expand Down