Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.
This Draft PR is a POC in dropping the hack we have built into Kedro to support the convenience syntax of
params:x.y.z
and replacing it with a proper query instead, powered byOmegaConf.select
. This way all datasets which load into a Python dictionary can provide the same functionality out of the box.A side effect is the removal of all those
params:xxxx
datasets from the output ofcatalog.list()
, which is something people have been annoyed by anyways. Nevertheless, it still presents a breaking change, so we need to decide whether it will need a new breaking Kedro version or it can go in as a bugfix/performance fix.This profiling revealed another potential source of slowness and that's the loading of the config files, which is something we should investigate further in the future.
Development notes
Tested with an autogenerated 2.2MB parameters file. As you can see that nearly 2/3 of the time is shaved off.
Before (~120s and 1.17GB memory):
After (~45s and peaked at ~1.10GB memory usage):
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file