-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitioned dataset does not work with parallel runner because of caching in exists method #623
Comments
@nilsbore Would you be able to provide a example repository that we can reproduce the result? In addition, what was the previous version that it works? AFAIK we haven't introduced changes to ParallelRunner or PartitionedDataset, so I would like to understand more is this a regression or a new bug. |
I will see if I can put together an example project today. In the meantime, I'll address the other questions:
So I think it's pretty safe to say it's a regression with kedro 0.19. |
I created a minimal example here: https://github.com/nilsbore/kedro-parallel-partitioned-bug . If you run |
@noklam Again, please let me know if I should move this to the kedro repo instead. Thanks for the help. |
@nilsbore Sorry for the late reply! I miss Github notification all the time. You can always find me in our Slack (kedro.slack.org ) Thank you for the example. This make sense, I did see a few issues with the ParallelRunner due to We can keep this in this repository as we monitor both repo with Github Project, I can transfer the issue if we confirm the changes should be done on the |
Description
When I run a pipeline containing parallel datasets created during the run using the command
kedro run --runner=ParallelRunner
I get an error for the parallel datasets when they are loaded by subsequent nodes:DatasetError: No partitions found in '<path>'
.Digging into the problem, it seems to be because of the line with the call
catalog.exists(dataset)
when calling the method_set_manager_datasets
inParallelRunner
. This will call the methodexists
onPartitionedDataset
which in turn calls the method_list_partitions
. This method has acachedmethod
decorator that causes subsequent calls toexists
when running the pipeline to returnFalse
. Removing thecachedmethod
decorator solves the issue.It is unclear if this is a bug with
PartitionedDataset
or withParallelRunner
so please let me know if I should move this to the kedro repo instead.Context
I cannot run my pipeline containing intermediate partitioned datasets using parallel runner. This blocks me from updating to kedro 0.19.
Steps to Reproduce
PartitionedDataset
.kedro run --runner=ParallelRunner
.Expected Result
The pipeline should run with no errors.
Actual Result
The pipeline fails with
when trying to load the intermediate partitioned dataset.
Your Environment
Thank your for your efforts with Kedro!
The text was updated successfully, but these errors were encountered: