Partitioned dataset does not work with parallel runner because of caching in exists method #623

nilsbore · 2024-03-21T09:23:42Z

Description

When I run a pipeline containing parallel datasets created during the run using the command kedro run --runner=ParallelRunner I get an error for the parallel datasets when they are loaded by subsequent nodes: DatasetError: No partitions found in '<path>'.

Digging into the problem, it seems to be because of the line with the call catalog.exists(dataset) when calling the method _set_manager_datasets in ParallelRunner. This will call the method exists on PartitionedDataset which in turn calls the method _list_partitions. This method has a cachedmethod decorator that causes subsequent calls to exists when running the pipeline to return False. Removing the cachedmethod decorator solves the issue.

It is unclear if this is a bug with PartitionedDataset or with ParallelRunner so please let me know if I should move this to the kedro repo instead.

Context

I cannot run my pipeline containing intermediate partitioned datasets using parallel runner. This blocks me from updating to kedro 0.19.

Steps to Reproduce

Create a pipeline with intermediate datasets (created and consumed by subsequent nodes) of type PartitionedDataset.
Run the pipeline using kedro run --runner=ParallelRunner.

Expected Result

The pipeline should run with no errors.

Actual Result

The pipeline fails with

`DatasetError: No partitions found in '<path>'`

when trying to load the intermediate partitioned dataset.

Your Environment

Kedro version used: version 0.19.3
Kedro datasets used: version 2.1.0
Python version used: Python 3.10.12
Operating system and version: Ubuntu 22.04

Thank your for your efforts with Kedro!

The text was updated successfully, but these errors were encountered:

noklam · 2024-03-25T13:10:03Z

@nilsbore Would you be able to provide a example repository that we can reproduce the result? In addition, what was the previous version that it works? AFAIK we haven't introduced changes to ParallelRunner or PartitionedDataset, so I would like to understand more is this a regression or a new bug.

nilsbore · 2024-03-26T07:53:21Z

I will see if I can put together an example project today. In the meantime, I'll address the other questions:

The last versions where I tested and it works are kedro 0.18.14 and kedro-datasets 1.7.1
This commit added the _set_manager_datasets logic in ParallelRunner mentioned above. Looks like it's been there since 0.19.0.

So I think it's pretty safe to say it's a regression with kedro 0.19.

nilsbore · 2024-03-26T12:21:41Z

I created a minimal example here: https://github.com/nilsbore/kedro-parallel-partitioned-bug . If you run kedro run --runner=ParallelRunner with an empty data folder, it will crash with DatasetError: No partitions found in '/path/to/kedro-parallel-partitioned-bug/data/a'. Running using kedro run with a clean data folder works as expected.

nilsbore · 2024-04-10T14:05:32Z

@noklam Again, please let me know if I should move this to the kedro repo instead. Thanks for the help.

noklam · 2024-04-10T14:36:22Z

@nilsbore Sorry for the late reply! I miss Github notification all the time. You can always find me in our Slack (kedro.slack.org )

Thank you for the example. This make sense, I did see a few issues with the ParallelRunner due to SharedMemoryDataset since 0.19.0, maybe they are all related.

We can keep this in this repository as we monitor both repo with Github Project, I can transfer the issue if we confirm the changes should be done on the kedro repo instead.

noklam added the bug Something isn't working label Mar 25, 2024

noklam added the Community Issue/PR opened by the open-source community label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned dataset does not work with parallel runner because of caching in exists method #623

Partitioned dataset does not work with parallel runner because of caching in exists method #623

nilsbore commented Mar 21, 2024 •

edited

noklam commented Mar 25, 2024

nilsbore commented Mar 26, 2024 •

edited

nilsbore commented Mar 26, 2024

nilsbore commented Apr 10, 2024 •

edited

noklam commented Apr 10, 2024

Partitioned dataset does not work with parallel runner because of caching in exists method #623

Partitioned dataset does not work with parallel runner because of caching in exists method #623

Comments

nilsbore commented Mar 21, 2024 • edited

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

noklam commented Mar 25, 2024

nilsbore commented Mar 26, 2024 • edited

nilsbore commented Mar 26, 2024

nilsbore commented Apr 10, 2024 • edited

noklam commented Apr 10, 2024

nilsbore commented Mar 21, 2024 •

edited

nilsbore commented Mar 26, 2024 •

edited

nilsbore commented Apr 10, 2024 •

edited