Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned dataset does not work with parallel runner because of caching in exists method #623

Open
nilsbore opened this issue Mar 21, 2024 · 5 comments
Labels
bug Something isn't working Community Issue/PR opened by the open-source community

Comments

@nilsbore
Copy link

nilsbore commented Mar 21, 2024

Description

When I run a pipeline containing parallel datasets created during the run using the command kedro run --runner=ParallelRunner I get an error for the parallel datasets when they are loaded by subsequent nodes: DatasetError: No partitions found in '<path>'.

Digging into the problem, it seems to be because of the line with the call catalog.exists(dataset) when calling the method _set_manager_datasets in ParallelRunner. This will call the method exists on PartitionedDataset which in turn calls the method _list_partitions. This method has a cachedmethod decorator that causes subsequent calls to exists when running the pipeline to return False. Removing the cachedmethod decorator solves the issue.

It is unclear if this is a bug with PartitionedDataset or with ParallelRunner so please let me know if I should move this to the kedro repo instead.

Context

I cannot run my pipeline containing intermediate partitioned datasets using parallel runner. This blocks me from updating to kedro 0.19.

Steps to Reproduce

  1. Create a pipeline with intermediate datasets (created and consumed by subsequent nodes) of type PartitionedDataset.
  2. Run the pipeline using kedro run --runner=ParallelRunner.

Expected Result

The pipeline should run with no errors.

Actual Result

The pipeline fails with

`DatasetError: No partitions found in '<path>'`

when trying to load the intermediate partitioned dataset.

Your Environment

  • Kedro version used: version 0.19.3
  • Kedro datasets used: version 2.1.0
  • Python version used: Python 3.10.12
  • Operating system and version: Ubuntu 22.04

Thank your for your efforts with Kedro!

@noklam noklam added the bug Something isn't working label Mar 25, 2024
@noklam
Copy link
Contributor

noklam commented Mar 25, 2024

@nilsbore Would you be able to provide a example repository that we can reproduce the result? In addition, what was the previous version that it works? AFAIK we haven't introduced changes to ParallelRunner or PartitionedDataset, so I would like to understand more is this a regression or a new bug.

@nilsbore
Copy link
Author

nilsbore commented Mar 26, 2024

I will see if I can put together an example project today. In the meantime, I'll address the other questions:

  1. The last versions where I tested and it works are kedro 0.18.14 and kedro-datasets 1.7.1
  2. This commit added the _set_manager_datasets logic in ParallelRunner mentioned above. Looks like it's been there since 0.19.0.

So I think it's pretty safe to say it's a regression with kedro 0.19.

@nilsbore
Copy link
Author

I created a minimal example here: https://github.com/nilsbore/kedro-parallel-partitioned-bug . If you run kedro run --runner=ParallelRunner with an empty data folder, it will crash with DatasetError: No partitions found in '/path/to/kedro-parallel-partitioned-bug/data/a'. Running using kedro run with a clean data folder works as expected.

@nilsbore
Copy link
Author

nilsbore commented Apr 10, 2024

@noklam Again, please let me know if I should move this to the kedro repo instead. Thanks for the help.

@noklam noklam added the Community Issue/PR opened by the open-source community label Apr 10, 2024
@noklam
Copy link
Contributor

noklam commented Apr 10, 2024

@nilsbore Sorry for the late reply! I miss Github notification all the time. You can always find me in our Slack (kedro.slack.org )

Thank you for the example. This make sense, I did see a few issues with the ParallelRunner due to SharedMemoryDataset since 0.19.0, maybe they are all related.

We can keep this in this repository as we monitor both repo with Github Project, I can transfer the issue if we confirm the changes should be done on the kedro repo instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Community Issue/PR opened by the open-source community
Projects
Status: No status
Development

No branches or pull requests

2 participants