Packaged kedro pipeline does not work well on Databricks #1807

noklam · 2022-08-24T09:37:38Z

Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789

Packaged kedro are run as a CLI -> Click generate a sys.exit() and cause some problem on databricks.

antonymilne · 2022-08-24T09:42:52Z

This is one of the things I have fixed in #1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway.

noklam · 2022-08-26T10:43:29Z

Add some more comments to document what happened. In summary, there are 2 issues:

The CLI mode generate some sys.exit() which some platform doesn't welcome (i.e. databricks), and users must revert to using KedroSession for a packaged kedro library.

from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)

with KedroSession.create(package_name) as session:
    session.run()

Related to Make rich traceback and REPL handling more configurable #1728, which rich override the ipython._showtraceback, and databricks will ignore any Python Exception, and treat a "databricks job" as success even if the Python program is terminated with exceptions.

antonymilne · 2022-09-08T13:51:43Z

A few more notes on this. The sys.exit problem is already known. This is actually a problem completely independent of databricks and happens just in a normal IPython session (which makes it much easier to try out). It's discussed further in #1423. There are ways around it that don't need you to invoke KedroSession explicitly. It basically boils down to invoking click in such a way that it doesn't do the sys.exit:

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1

run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible

run.main is equivalent to what we were doing before (run.__call__) so doesn't help

run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Problem 2 is fixed by #1769:

A fix for the issue raised by @pstanisl in #1728. Hopefully this will be fixed on the rich side in Textualize/rich#2455 in due course, but for now I'm fixing it here because it's proving problematic, e.g. for @PedroAbreuQB. Jobs on databricks that should be detected as Failed and currently marked as Succeeded.

The solution is just to disable rich.traceback.install on databricks.

astrojuanlu · 2023-07-12T09:55:06Z

This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b

Looking at the traceback, this is the code the user was trying to run:

params_string = ','.join(list_params)
main(
    [
        "--env", conf,
        "--pipeline", pipeline,
        f"--params={params_string}"
    ]
)

And filling the gaps, I think the main function does something like this:

run = _find_run_command(package_name)
run(*args, **kwargs)

which resembles https://github.com/kedro-org/kedro/blob/0.18.11/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__main__.py

So, most likely the user is doing

from kedro_project.__main__ import main

main()

This touches on #2384 and #2682. I think we should address this soon.

astrojuanlu · 2023-07-12T09:59:36Z

In line with what I said in #2682 (comment),

I don't think we should rely on CLI functions to programmatically run Kedro (and so I'm in favor of refactoring rather than using standalone=True as suggested in Consider using run(standalone_mode=True) to fix packaged Kedro project getting a sys.exit #2682)
In that refactor I think we should hide CLI functions from the public API (otherwise users will be tempted to use them in Python, which is conceptually wrong, aside from click decision to raise an exception)

astrojuanlu · 2023-11-08T11:55:58Z

This bug report is more than 1 year old. Can we get an updated view of what's happening here?

We have 3 official workflows for Databricks:

Using Databricks notebooks https://docs.kedro.org/en/stable/deployment/databricks/databricks_notebooks_development_workflow.html we are recommending the user that they do session.run(). As far as I understand, this is correct.
Using dbx¹ https://docs.kedro.org/en/latest/deployment/databricks/databricks_ide_development_workflow.html we are also recommending session.run(), so same as above.
Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this:

from <package_name>.__main__ import main

main(
    ["--pipeline", "__default__"]
)  # or simply main() if you don't want to provide any arguments

which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically.

@noklam called this above a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
    session.run()

? Should we update our docs instead?

@idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs.

What am I missing?

Note that dbx is now deprecated in favour of Databricks Asset Bundles, this change hasn't made it to our docs yet https://github.com/kedro-org/kedro/pull/3212/ ↩

noklam · 2023-11-08T12:21:37Z

@astrojuanlu The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, , python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

The benefit of having an entrypoint is that most deployment system support this because it is just python -m. To use session.run you requires a custom python script for execution.

@noklam called this #1807 (comment) a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
session.run()
See #1423 (comment), originally I even propose kedro run --package_name=xxx

Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

For this, I think this is a bad workaround as it shouldn't need a special entrypoint, #3191 should replace this.
Assuming you didn't start with databricks-iris, you will need to do those copy & paste. This step can be removed because any Kedro project can share ONE entrypoint.

Honestly I don't hate KedroSession because users are already familar with it, and they can stick with it for both a normal Kedro Project or a packaged one. I see that there are some extra benefit for the entrypoint and keeping things consistently.

I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option.

astrojuanlu · 2023-11-08T13:18:05Z

The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

For this, I think this is a bad workaround as it shouldn't need a special entrypoint

I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work.

noklam · 2023-11-08T13:24:16Z

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

It is more generic. from my_package import main wouldn't work nicely in any interactive environment (Ipython, Jupyter) because of the sys.exit. Of course you can still use the workaround by manually doing bootstrap_project

#3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works)

astrojuanlu · 2023-11-08T13:44:21Z

Of course you can still use the workaround by manually doing bootstrap_project

This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it.

I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic (bootstrap_project? configure_project? what do these functions do?) that obscure what's happening.

idanov · 2023-11-09T10:44:10Z

In the past we had a few nice .puml diagrams, showing the whole list of possible execution flows in different modes of execution, but it seems that they have been deleted from the repo. But in general, the idea is that we have two distinct regimes:

running during development, i.e. off the source code
running during production, i.e. off of a packaged project

bootstrap_project is only used in the first regime, because it adds src/ or the custom directory you've decided to use to the Python import path. configure_project is the one allowing you to import things from your own project through kedro.framework.project and is essential for having a uniform way for Kedro and Kedro plugins to access project specific things like settings and registered pipelines without knowing the project package name (which is obviously always different for different Kedro projects). configure_project is what gives Kedro the django-like experience for settings.

It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old .puml diagrams we've had.

astrojuanlu · 2023-11-10T08:05:17Z

.puml diagrams for reference https://github.com/kedro-org/kedro/tree/0.17.7/docs/draft

astrojuanlu · 2023-11-10T08:08:31Z

I generated a reproducer that doesn't depend on Kedro.

Source code: https://github.com/astrojuanlu/simple-click-entrypoint

Job definition:

Result:

astrojuanlu · 2023-11-10T08:14:04Z

Reported this upstream MicrosoftDocs/azure-docs#116964

astrojuanlu · 2023-12-05T08:38:38Z

Did a bit more debugging #3191 (review) the sys.ps1 trick proposed there is not working. There's not enough clarity on how Databricks launches the entry point.

It's clear that passing standalone_mode=False should work - the problem is how to detect when to pass it.

astrojuanlu · 2024-02-19T09:31:01Z

The sorry state of this issue is as follows:

Microsoft created an internal work item to improve the docs, no way to track progress in public
Databricks (in private communication) expressed that they'd rather not document the low-level details of how Databricks invokes the code
Our conditional activation of the standalone mode in Make default project entrypoint interactive (Databricks) friendly #3191 didn't work, so I went ahead and closed the PR
Unconditionally using standalone mode is probably unwise, since the CLI would likely misbehave

We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default __main__ and entry point #3051, ideally making them independent from our Click wrappers and avoid having 5 different context-dependent ways of running a Kedro project #3237 but maybe there are other possible solutions.

noklam · 2024-02-28T15:51:31Z

Leave a comment here: #3237 (comment). This goes beyond Databricks so I want to keep the conversation on the broader topic.

noklam mentioned this issue Aug 26, 2022

Improve kedro run as a package #1423

Closed

10 tasks

antonymilne added this to the Improve kedro run entrypoints milestone Sep 5, 2022

antonymilne added the Issue: Bug Report 🐞 Bug that needs to be fixed label Sep 5, 2022

noklam mentioned this issue Sep 24, 2022

KedroSession not working on packaged kedro project #1869

Closed

merelcht modified the milestones: Improve kedro run entrypoints, Make Kedro work on Databricks, Ensure packaged Kedro projects can run smoothly everywhere Feb 6, 2023

This was referenced May 26, 2023

Create a databricks-iris starter that enables packaged deployment on Databricks kedro-org/kedro-starters#129

Merged

Add documentation for deploying packaged Kedro projects on Databricks #2595

Merged

This comment was marked as off-topic.

Sign in to view

This was referenced Jun 14, 2023

package Kedro project should return session.run #2681

Open

Consider using run(standalone_mode=True) to fix packaged Kedro project getting a sys.exit #2682

Open

merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 17, 2023

merelcht added the TD: implementation Tech Design topic on implementation of the issue label Jul 28, 2023

idanov mentioned this issue Oct 18, 2023

Make default project entrypoint interactive (Databricks) friendly #3191

Closed

7 tasks

noklam mentioned this issue Oct 27, 2023

Improve Kedro run as a Package (2023) #3237

Open

4 tasks

astrojuanlu mentioned this issue Feb 28, 2024

Users cannot install specific components of Kedro separately #3659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packaged kedro pipeline does not work well on Databricks #1807

Packaged kedro pipeline does not work well on Databricks #1807

noklam commented Aug 24, 2022 •

edited

antonymilne commented Aug 24, 2022 •

edited

noklam commented Aug 26, 2022

antonymilne commented Sep 8, 2022

Adventures with click

This comment was marked as off-topic.

astrojuanlu commented Jul 12, 2023

astrojuanlu commented Jul 12, 2023

astrojuanlu commented Nov 8, 2023 •

edited

noklam commented Nov 8, 2023 •

edited

astrojuanlu commented Nov 8, 2023

noklam commented Nov 8, 2023

astrojuanlu commented Nov 8, 2023

idanov commented Nov 9, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Dec 5, 2023

astrojuanlu commented Feb 19, 2024 •

edited

noklam commented Feb 28, 2024

Packaged kedro pipeline does not work well on Databricks #1807

Packaged kedro pipeline does not work well on Databricks #1807

Comments

noklam commented Aug 24, 2022 • edited

antonymilne commented Aug 24, 2022 • edited

noklam commented Aug 26, 2022

antonymilne commented Sep 8, 2022

Adventures with click

This comment was marked as off-topic.

astrojuanlu commented Jul 12, 2023

astrojuanlu commented Jul 12, 2023

astrojuanlu commented Nov 8, 2023 • edited

Footnotes

noklam commented Nov 8, 2023 • edited

astrojuanlu commented Nov 8, 2023

noklam commented Nov 8, 2023

astrojuanlu commented Nov 8, 2023

idanov commented Nov 9, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Nov 10, 2023

astrojuanlu commented Dec 5, 2023

astrojuanlu commented Feb 19, 2024 • edited

noklam commented Feb 28, 2024

noklam commented Aug 24, 2022 •

edited

antonymilne commented Aug 24, 2022 •

edited

astrojuanlu commented Nov 8, 2023 •

edited

noklam commented Nov 8, 2023 •

edited

astrojuanlu commented Feb 19, 2024 •

edited