Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to get an example pipeline #3295

Merged
merged 16 commits into from
Nov 22, 2023

Conversation

DimedS
Copy link
Contributor

@DimedS DimedS commented Nov 9, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

This PR introduces an --example flag to the kedro new command. Users can specify --example y/n directly in the command line to indicate their preference. If not specified, users will be prompted interactively, with 'No' as the default selection.

User flow:

  1. When a user selects Viz, PySpark, or both, the corresponding starter will be provided.
    If example=No is chosen in this scenario, data and config will be removed via hooks in the selected starter.
  2. If neither Viz nor PySpark is selected:
    If example=Yes is chosen, the 'spaceflights-starter' will be provided.
    If example=No is chosen, a standard template will be provided.
  3. Then final template will be stripped according to add_ons 1-5 selection, exception: Data folder, it will not be stripped if example=Yes chosen.

Development notes

  1. Temporary added a checkout for "3076-add-example-pipeline-to-hooks" branch of starters git repo to properly test together with changes in starters in PR 181 (643-663 lines in starters.py)
  2. Will add additional unit tests to cover my changes.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@DimedS
Copy link
Contributor Author

DimedS commented Nov 9, 2023

Do we need to copy postGen cookiecutter hooks to 4 spaceflights starters to enable stripping based on add_ons chosen, or we already have a better solution for that?
@merelcht @noklam @ankatiyar @SajidAlamQB

starter_path = "git+https://github.com/kedro-org/kedro-starters.git"
if add_ons:
if example_pipeline:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. Even if the example pipeline isn't wanted, we might need to fetch from a starter template other than the default. @SajidAlamQB can probably help out here about what the flow should be.

Copy link
Contributor Author

@DimedS DimedS Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merelcht , thank you. You right, I need to consult with @SajidAlamQB what should we do in case of Spark and Viz but no example option choosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need to be changed as we already select the right template with examples in them for pyspark, viz and pyspark & viz. These templates already contain the example pipelines but are removed in hooks/utils.py _handle_starter_setup. So the current default case is example=no. I think we just need a check in utils.py to skip _handle_starter_setup if example=yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SajidAlamQB, many thanks, got it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only for the default template [1-5], options without pyspark and viz, we don't already have example pipelines so we might need to change template to spaceflights-pandas in here if example=yes.

Copy link
Contributor Author

@DimedS DimedS Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only for the default template [1-5], options without pyspark and viz, we don't already have example pipelines so we might need to change template to spaceflights-pandas in here if example=yes.

yes, looks like logic should be different for pyspark, viz and without them:

  1. without pyspark, viz: If --example==yes, use sfaceflights-pandas starter, elif --example ==no, use default template
  2. with pyspark or viz or both: use starter anyway, but we need modifications in _handle_starter_setup

Last step: strip all of them with the same algorithm based on add_ons 1-5.

Is it correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds right.

@DimedS DimedS linked an issue Nov 13, 2023 that may be closed by this pull request
@DimedS DimedS force-pushed the 3076-add-an-option-to-get-an-example-pipeline branch from c73d901 to 73d4acb Compare November 13, 2023 15:08
@noklam
Copy link
Contributor

noklam commented Nov 14, 2023

#3054

@DimedS
I notice the current order doesn't match with #3054, the example prompt should happen before project_name
image

In addition, I wonder is --example y/n needed? Can it be reduced to --example since it is a boolean option or is it unclear? @amandakys

@DimedS
Copy link
Contributor Author

DimedS commented Nov 14, 2023

#3054

@DimedS I notice the current order doesn't match with #3054, the example prompt should happen before project_name

In addition, I wonder is --example y/n needed? Can it be reduced to --example since it is a boolean option or is it unclear? @amandakys

Hm, I see that the last version of #3054 with --example on the end:
#3054 (comment)

@noklam
Copy link
Contributor

noklam commented Nov 14, 2023

I see - I didn't see the latest comment. We should push the project name as the first prompt, but it's out of scope for this PR.

@DimedS DimedS force-pushed the 3076-add-an-option-to-get-an-example-pipeline branch from 2ecafbf to 01d1c0d Compare November 14, 2023 11:31
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS force-pushed the 3076-add-an-option-to-get-an-example-pipeline branch from 32c4477 to 8c6d5c1 Compare November 15, 2023 16:21
@DimedS DimedS marked this pull request as ready for review November 15, 2023 16:59
@@ -3,4 +3,4 @@ include LICENSE.md
include kedro/framework/project/default_logging.yml
include kedro/ipython/*.png
include kedro/ipython/*.svg
recursive-include templates *
recursive-include kedro/templates *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended?

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for the PR and the detailed description. Having it checkout to a feature starter branch save me a lot of time to review and test.

I leave some comments but I think the key questions are

  1. Can we simplify example_pipeline as a boolean flag from the CLI with just --example? If not, we should parse this early on as a boolean save ourselves from checking example_pipeline is a None or not at every step, since I think they don't make a difference.
  2. Validation - the validation function assume click and it also does parsing, I think should keep it separated.

kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
extra_context["example_pipeline"] = (
example_pipeline # type: ignore
if example_pipeline is not None
else _validate_yn(None, None, extra_context.get("example_pipeline", None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _validate_yn is expected to used in click. It feels weird to have _validate_yn(None, None, ...). Can we simplify this a little bit, there is a lot of handling or None, if we can make sure this is a bool early on we can save ourselves a lot of checking.

kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/framework/cli/starters.py Show resolved Hide resolved
Comment on lines 172 to 178
if "Data Structure" not in selected_add_ons_list and example_pipeline != "True":
_remove_dir(current_dir / "data")

if "Pyspark" in selected_add_ons_list:
if "Pyspark" in selected_add_ons_list and example_pipeline != "True":
_handle_starter_setup(selected_add_ons_list, python_package_name)

if "Kedro Viz" in selected_add_ons_list:
if "Kedro Viz" in selected_add_ons_list and example_pipeline != "True":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic here is correct but it is hard to follow. I notice that _handle_starter_setup is highly coupled with PySpark and Kedro Viz, in fact this function only get called if these plugins get selected.

  1. The only difference is if viz is selected, one extra reporting pipeline is removed. I think this should reflected in the argument of _handle_starter_setup, we don't really need selected_add_ons_list, a boolean flag is enough.

  2. Maybe it helps to have a nested if here, for me it easier to map it to the diagram.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with @noklam here. I would rename _handle_starter_setup to make it clear this is removal of files, and I also think the code inside _handle_starter_setup can be more generic. We don't need to check pipeline names and test names, but just remove all files inside the pipelines folder etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated that part to clarify that it's only about Viz and PySpark, @merelcht do you think we should remove all files in tests/pipelines except of init.py?

@DimedS
Copy link
Contributor Author

DimedS commented Nov 16, 2023

  1. Can we simplify example_pipeline as a boolean flag from the CLI with just --example? If not, we should parse this early on as a boolean save ourselves from checking example_pipeline is a None or not at every step, since I think they don't make a difference.
  2. Validation - the validation function assume click and it also does parsing, I think should keep it separated.

@noklam , many thanks for your notes.
I agree with all you written:

  1. If we can use just --example flag not --example='y/n' in CLI, it will be simple - less validation and parsing, but I think we cannot do that - if client will not provide --example flag in CLI it doesn't mean that he don't want example (we cannot put False here) - we need to ask him in Prompt. So I think we should leave it the same to provide a possibility to write 'No' in CLI.
  2. I'm also unhappy with current validation situation, I tried to put CLI validation directly to @click.option in click.Choice(["y", "n", "yes", "no"] and parsing in callback=_validate_yn func (seems to be a wrong function naming better name that func _parsing, but validation and parsing are separated currently). But it seems unbeautiful because of specific callback parsing function format that you mentioned.
    And I agree that it's better to unify validation process (also wrote about it there) because it's 3 possible inputs (CLI, config, prompts) better to validate and parse them in the same code if it's possible.
    And you right, currently I don't provide a validation for loading from config, only parsing in _validate_yn().
    I will rewrite validation and parsing part.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code

if template_path == str(TEMPLATE_PATH) or (
add_ons and ("Pyspark" in add_ons or "Kedro Viz" in add_ons)
):
if add_ons == "[]": # TODO: This should be a list
click.secho("\nYou have selected no add-ons")
else:
click.secho(f"\nYou have selected the following add-ons: {add_ons}")
should also be updated. Now you don't see which add-ons you have selected when you use an example.

kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/framework/cli/starters.py Outdated Show resolved Hide resolved
kedro/templates/project/cookiecutter.json Outdated Show resolved Hide resolved
Comment on lines 172 to 178
if "Data Structure" not in selected_add_ons_list and example_pipeline != "True":
_remove_dir(current_dir / "data")

if "Pyspark" in selected_add_ons_list:
if "Pyspark" in selected_add_ons_list and example_pipeline != "True":
_handle_starter_setup(selected_add_ons_list, python_package_name)

if "Kedro Viz" in selected_add_ons_list:
if "Kedro Viz" in selected_add_ons_list and example_pipeline != "True":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with @noklam here. I would rename _handle_starter_setup to make it clear this is removal of files, and I also think the code inside _handle_starter_setup can be more generic. We don't need to check pipeline names and test names, but just remove all files inside the pipelines folder etc.

DimedS and others added 7 commits November 17, 2023 18:35
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS requested a review from noklam November 20, 2023 16:51
@DimedS
Copy link
Contributor Author

DimedS commented Nov 20, 2023

It appears that I've addressed all the review comments and have also included new tests to cover the example option. Could you please review it again? Currently, I'm noticing an issue with the docs, I will investigate it further, other tests are ok.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code

if template_path == str(TEMPLATE_PATH) or (
add_ons and ("Pyspark" in add_ons or "Kedro Viz" in add_ons)
):
if add_ons == "[]": # TODO: This should be a list
click.secho("\nYou have selected no add-ons")
else:
click.secho(f"\nYou have selected the following add-ons: {add_ons}")

should also be updated. Now you don't see which add-ons you have selected when you use an example.

@DimedS This still needs to be updated. If I choose an example I don't get a message showing me what add_ons I chose.

cookiecutter_args["directory"] = "spaceflights-pyspark-viz"
cookiecutter_args[
"checkout"
] = "3076-add-example-pipeline-to-hooks" # ToDel: temporary for test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to change this back before merging, and all other occurrences below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I can change it back only after this PR will be merged, otherwise some tests will fail

kedro/framework/cli/starters.py Show resolved Hide resolved
kedro/templates/project/hooks/utils.py Outdated Show resolved Hide resolved
tests/framework/cli/test_starters.py Outdated Show resolved Hide resolved
@stichbury stichbury removed their request for review November 21, 2023 10:49
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
@DimedS DimedS force-pushed the 3076-add-an-option-to-get-an-example-pipeline branch from b9e91b7 to 6c03f9f Compare November 21, 2023 12:11
@DimedS
Copy link
Contributor Author

DimedS commented Nov 21, 2023

@DimedS This still needs to be updated. If I choose an example I don't get a message showing me what add_ons I chose.

Sorry, missed it, fixed with other last comments

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to be merged now! Nice work @DimedS 👏
Don't forget to add it to the release notes 😃

Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this manually and all looks good! Thanks you @DimedS awesome work! 🌟

expected_files += sum(example_files_count)
expected_files += (
4 if "7" in add_ons_list else 0
) # add 3 .py and 1 parameters files in reporting for Viz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, these tests can be confusing to understand. 😁

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is one of the more complicated piece.

Don't remember to update the RELEASE.md and revert the hardcoded branch.

        "checkout"
        ] = "3076-add-example-pipeline-to-hooks" 

@DimedS DimedS merged commit 9ea4ce8 into develop Nov 22, 2023
36 checks passed
@DimedS DimedS deleted the 3076-add-an-option-to-get-an-example-pipeline branch November 22, 2023 20:07
stichbury pushed a commit that referenced this pull request Nov 28, 2023
Add an option to get an example pipeline, fix and add new tests

---------

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an option to get an example pipeline in the project creation flow
4 participants