-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a docs page about adding code beyond starter files #3852
base: main
Are you sure you want to change the base?
Create a docs page about adding code beyond starter files #3852
Conversation
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
@noklam, @astrojuanlu, could you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution @yury-fedotov! This is already looking great ⭐
I left some comments, mostly small nits.
`__main__.py` file of your project, but such advanced customizations are not in scope of this article. | ||
``` | ||
|
||
This being the only constraint means that you can, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if the points below are linked to the section that describe them in more detail. A bit like how we setup the configuration page with "how to.." https://kedro--3852.org.readthedocs.build/en/3852/configuration/configuration_basics.html#how-to-use-kedro-configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Merel, great suggestion, just implemented this. Rendered build here. Let me know if that's what you had in mind.
Comment on customizations Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
For example in splitting Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Make provided examples title not bold Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
And as where appropriate Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Organise spelling Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Proper wording for pip install note Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Implies to consists replacement Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com> # Conflicts: # RELEASE.md
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Hey @merelcht - thanks for comments again, I just finished implementing them. QQ also about the |
Yes, it's a soft check. I usually treat it as a guidance on grammar/word usage. It does also check for spelling mistakes, which is helpful. But you definitely don't need to address it all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for the contribution!
It would be great if @stichbury can help review the style of the docs a little bit, otherwise have you seen https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide already?
I like a lot of the recommendation here, but I hesitate to make it an official recommendation without the team discussing a bit more. Should we extract the content and make it a guest blog post instead? @astrojuanlu
read and collaborate on as your codebase grows. | ||
Those files also sometimes make new users think that Kedro requires code | ||
to be located only in those starter files, which is not true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the clarification, but I found this sounds like coming from an user rather than the official docs.
#2512 (comment), we have an answer for this and the issue is still opened. Would it be better to actually write the documentation and link here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on what exact change are you proposing here?
To reference this GH issue right in the code_beyond_starter_files.md
?
Or to make content of comments on that issue part of this new section of docs?
```{note} | ||
You actually can make Kedro look for pipeline registry in a different place by modifying the | ||
`__main__.py` file of your project, but such advanced customisations are not in scope for this section. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually do this? My impressions is no because
def configure_project(package_name: str) -> None:
"""Configure a Kedro project by populating its settings with values
defined in user's settings.py and pipeline_registry.py.
"""
settings_module = f"{package_name}.settings"
settings.configure(settings_module)
pipelines_module = f"{package_name}.pipeline_registry"
pipelines.configure(pipelines_module)
I think these are the only two files that you cannot change the location, the comment I link above has a demo project that show what a "minimal Kedro Project" looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed:
kedro/kedro/framework/project/__init__.py
Lines 279 to 280 in 7391f4f
pipelines_module = f"{package_name}.pipeline_registry" | |
pipelines.configure(pipelines_module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I removed this {note}
and mentioned that settings.py
also should be in the original location.
``` | ||
|
||
This being the only constraint means that you can, for example: | ||
* Add `utils.py` file to a pipeline folder and import utilities used by multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually avoid utils.py
as much as possible as they are the bin for everything. It's ironic because kedro
do have utils
module that are left from years ago. It hasn't been growing though as we believe it's better to have explicit module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam how would you then call a .py
file that is not nodes.py
or pipeline.py
for the purpose of this example? I was thinking of dataframe_utils.py
, but didn't like it because it adds to the impression that Kedro is only useful in data processing projects, which isn't true.
This being the only constraint means that you can, for example: | ||
* Add `utils.py` file to a pipeline folder and import utilities used by multiple | ||
functions in `nodes.py` from there. | ||
* [Share modules between pipelines](#sharing-modules-between-pipelines). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be really beneficial if we have something to showcase, maybe there are some projects in awesome-kedro
that we can link to?
Even just the tree
structure of a project would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add `utils.py` file to a pipeline folder and import utilities used by multiple | ||
functions in `nodes.py` from there. | ||
* [Share modules between pipelines](#sharing-modules-between-pipelines). | ||
Those could be utility functionalities, or your standalone module responsible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those could be utility functionalities, or your standalone module responsible for | |
Those could be utility functionalities, or standalone module responsible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
functions in `nodes.py` from there. | ||
* [Share modules between pipelines](#sharing-modules-between-pipelines). | ||
Those could be utility functionalities, or your standalone module responsible for | ||
the domain logic of the industry you work at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the domain logic of the industry you work at. | |
the domain specific logic. |
or the domain logic of the specific industry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, | ||
into `historical_pipeline.py` and `inference_pipeline.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we usually go for the namespace/modular structure more:
- pipelines
- historical
- inference
- pipeline.py
@astrojuanlu thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that historical
vs inference
sound like candidates for leveraging same pipeline just different namespace. To avoid the confusion, I changed the example to this:
If you have multiple large `Pipeline` objects defined in a single `pipeline.py`,
split them into separate `.py` files. For example, in `data_processing` pipeline
you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`.
Is that better?
* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, | ||
into `historical_pipeline.py` and `inference_pipeline.py`. | ||
* Instead of registering many pipelines in `register_pipelines()` function one by one, | ||
create a few `tp.Dict[str, Pipeline]` objects in different places of the project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a few `tp.Dict[str, Pipeline]` objects in different places of the project | |
create a few `dict[str, Pipeline]` objects in different places of the project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, there were 2 places where I mention tp.Dict[str, Pipeline]
. Replaced to dict
in both.
into `historical_pipeline.py` and `inference_pipeline.py`. | ||
* Instead of registering many pipelines in `register_pipelines()` function one by one, | ||
create a few `tp.Dict[str, Pipeline]` objects in different places of the project | ||
and then make `register_pipelines()` return a union of those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about find_pipelines()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added:
While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery),
for large projects you may want a finer control and register pipelines manually.
I am happy to review/edit this if you decide to retain as docs, but please let me know what you decide about this:
LMK if you want to have a blog post instead, which sounds like a great idea, and then you could extract from the post if you wanted a longer-term piece of content that you maintain officially as docs. I wouldn't be able to craft the post without some notice and a ticket, but would definitely support the idea and help with later stages to edit and post, if you need it. Just ping me again when you decide how to go forward! |
Hi @stichbury, we've discussed this in backlog grooming and decided we'd like to get it in as docs. I really appreciate any help you can give getting this in shape 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this contribution @yury-fedotov ! I left a couple of comments of outstanding things I saw.
I agree it would be nice to have this as a "How-To" guide (pretty much aligned with the https://diataxis.fr/ taxonomy and other parts of our docs). I would have it maybe under "Advanced usage", somewhere, although just for continuity, not really a fan of the "Advanced" name there.
```{note} | ||
You actually can make Kedro look for pipeline registry in a different place by modifying the | ||
`__main__.py` file of your project, but such advanced customisations are not in scope for this section. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed:
kedro/kedro/framework/project/__init__.py
Lines 279 to 280 in 7391f4f
pipelines_module = f"{package_name}.pipeline_registry" | |
pipelines.configure(pipelines_module) |
│ ├── optimizer <-- Standalone package. | ||
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. | ||
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. | ||
├── pyproject.toml <-- Settings for those, like `[tool.isort]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that having this astray pyproject.toml
file here automatically communicates to pip
that this is potentially a Python package. This has been an ongoing source of pain in the ecosystem, see my summary here #2280 (comment)
Now what's the best practice to manage tool configs in monorepos for Python given this issue... I have no idea, to be honest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed pyproject.toml
and replaced it by this (starts with "Examples of what you may want in the repo root"
).
└── repo_root
├── packages
│ ├── kedro_project <-- A Kedro project for ML model training.
│ │ ├── conf
│ │ ├── data
│ │ ├── notebooks
│ │ ├── ...
│ ├── optimizer <-- Standalone package.
│ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline.
├── ...
├── ... Examples of what you may want in the repo root:
├── requirements.txt <-- Linters, code formatters... Not dependencies of packages.
├── ruff.toml <-- Or configs for other tools that you want to share between packages.
└── ...
For sure! I'm not sure if I'll be able to get to this in this sprint, but I'll add to my list; otherwise it'll be next week. To save holding you up, probably best to get it into the state where you want me to do a final review/update and let me know it is ready. |
Thanks for the comments team! I'll get to them next week and come back. |
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com> # Conflicts: # RELEASE.md
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Signed-off-by: Yury Fedotov <yury_fedotov@mckinsey.com>
Description
Per this discussion in Slack with @noklam, I took a task of creating a guide on how to add code beyond starter files in Kedro.
P.S. It's my first contribution to the project, so I'd be happy to iterate as much as needed until it satisfies core team's needs. And appreciate any feedback.
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file