Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaged kedro pipeline does not work well on Databricks #1807

Open
noklam opened this issue Aug 24, 2022 · 18 comments
Open

Packaged kedro pipeline does not work well on Databricks #1807

noklam opened this issue Aug 24, 2022 · 18 comments
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: implementation Tech Design topic on implementation of the issue

Comments

@noklam
Copy link
Contributor

noklam commented Aug 24, 2022

Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789

  • Packaged kedro are run as a CLI -> Click generate a sys.exit() and cause some problem on databricks.
@antonymilne
Copy link
Contributor

antonymilne commented Aug 24, 2022

This is one of the things I have fixed in #1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway.

@noklam
Copy link
Contributor Author

noklam commented Aug 26, 2022

Add some more comments to document what happened. In summary, there are 2 issues:

  1. The CLI mode generate some sys.exit() which some platform doesn't welcome (i.e. databricks), and users must revert to using KedroSession for a packaged kedro library.
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)

with KedroSession.create(package_name) as session:
    session.run()
  1. Related to Make rich traceback and REPL handling more configurable #1728, which rich override the ipython._showtraceback, and databricks will ignore any Python Exception, and treat a "databricks job" as success even if the Python program is terminated with exceptions.

@antonymilne antonymilne added the Issue: Bug Report 🐞 Bug that needs to be fixed label Sep 5, 2022
@antonymilne
Copy link
Contributor

A few more notes on this. The sys.exit problem is already known. This is actually a problem completely independent of databricks and happens just in a normal IPython session (which makes it much easier to try out). It's discussed further in #1423. There are ways around it that don't need you to invoke KedroSession explicitly. It basically boils down to invoking click in such a way that it doesn't do the sys.exit:

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

  • run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1
  • run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible
  • run.main is equivalent to what we were doing before (run.__call__) so doesn't help
  • run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Problem 2 is fixed by #1769:

A fix for the issue raised by @pstanisl in #1728. Hopefully this will be fixed on the rich side in Textualize/rich#2455 in due course, but for now I'm fixing it here because it's proving problematic, e.g. for @PedroAbreuQB. Jobs on databricks that should be detected as Failed and currently marked as Succeeded.

The solution is just to disable rich.traceback.install on databricks.

@astrojuanlu

This comment was marked as off-topic.

@astrojuanlu
Copy link
Member

This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b

Looking at the traceback, this is the code the user was trying to run:

params_string = ','.join(list_params)
main(
    [
        "--env", conf,
        "--pipeline", pipeline,
        f"--params={params_string}"
    ]
)

And filling the gaps, I think the main function does something like this:

run = _find_run_command(package_name)
run(*args, **kwargs)

which resembles https://github.com/kedro-org/kedro/blob/0.18.11/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__main__.py

So, most likely the user is doing

from kedro_project.__main__ import main

main()

This touches on #2384 and #2682. I think we should address this soon.

@astrojuanlu
Copy link
Member

In line with what I said in #2682 (comment),

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jul 17, 2023
@merelcht merelcht added the TD: implementation Tech Design topic on implementation of the issue label Jul 28, 2023
@astrojuanlu
Copy link
Member

astrojuanlu commented Nov 8, 2023

This bug report is more than 1 year old. Can we get an updated view of what's happening here?

We have 3 official workflows for Databricks:

On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this:

from <package_name>.__main__ import main

main(
    ["--pipeline", "__default__"]
)  # or simply main() if you don't want to provide any arguments

which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically.

@noklam called this above a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
    session.run()

? Should we update our docs instead?

@idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs.

What am I missing?

Footnotes

  1. Note that dbx is now deprecated in favour of Databricks Asset Bundles, this change hasn't made it to our docs yet https://github.com/kedro-org/kedro/pull/3212/

@noklam
Copy link
Contributor Author

noklam commented Nov 8, 2023

@astrojuanlu The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, , python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

The benefit of having an entrypoint is that most deployment system support this because it is just python -m. To use session.run you requires a custom python script for execution.

@noklam called this #1807 (comment) a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
session.run()
See #1423 (comment), originally I even propose kedro run --package_name=xxx

Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

For this, I think this is a bad workaround as it shouldn't need a special entrypoint, #3191 should replace this.
Assuming you didn't start with databricks-iris, you will need to do those copy & paste. This step can be removed because any Kedro project can share ONE entrypoint.

Honestly I don't hate KedroSession because users are already familar with it, and they can stick with it for both a normal Kedro Project or a packaged one. I see that there are some extra benefit for the entrypoint and keeping things consistently.

I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option.

@astrojuanlu
Copy link
Member

The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

For this, I think this is a bad workaround as it shouldn't need a special entrypoint

I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work.

@noklam
Copy link
Contributor Author

noklam commented Nov 8, 2023

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

It is more generic. from my_package import main wouldn't work nicely in any interactive environment (Ipython, Jupyter) because of the sys.exit. Of course you can still use the workaround by manually doing bootstrap_project

#3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works)

@astrojuanlu
Copy link
Member

Of course you can still use the workaround by manually doing bootstrap_project

This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it.

I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic (bootstrap_project? configure_project? what do these functions do?) that obscure what's happening.

@idanov
Copy link
Member

idanov commented Nov 9, 2023

In the past we had a few nice .puml diagrams, showing the whole list of possible execution flows in different modes of execution, but it seems that they have been deleted from the repo. But in general, the idea is that we have two distinct regimes:

  • running during development, i.e. off the source code
  • running during production, i.e. off of a packaged project

bootstrap_project is only used in the first regime, because it adds src/ or the custom directory you've decided to use to the Python import path. configure_project is the one allowing you to import things from your own project through kedro.framework.project and is essential for having a uniform way for Kedro and Kedro plugins to access project specific things like settings and registered pipelines without knowing the project package name (which is obviously always different for different Kedro projects). configure_project is what gives Kedro the django-like experience for settings.

It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old .puml diagrams we've had.

@astrojuanlu
Copy link
Member

.puml diagrams for reference https://github.com/kedro-org/kedro/tree/0.17.7/docs/draft

@astrojuanlu
Copy link
Member

I generated a reproducer that doesn't depend on Kedro.

Source code: https://github.com/astrojuanlu/simple-click-entrypoint

Job definition:

image

Result:

image

image

@astrojuanlu
Copy link
Member

Reported this upstream MicrosoftDocs/azure-docs#116964

@astrojuanlu
Copy link
Member

Did a bit more debugging #3191 (review) the sys.ps1 trick proposed there is not working. There's not enough clarity on how Databricks launches the entry point.

It's clear that passing standalone_mode=False should work - the problem is how to detect when to pass it.

@astrojuanlu
Copy link
Member

astrojuanlu commented Feb 19, 2024

The sorry state of this issue is as follows:

  • Microsoft created an internal work item to improve the docs, no way to track progress in public
  • Databricks (in private communication) expressed that they'd rather not document the low-level details of how Databricks invokes the code
  • Our conditional activation of the standalone mode in Make default project entrypoint interactive (Databricks) friendly #3191 didn't work, so I went ahead and closed the PR
  • Unconditionally using standalone mode is probably unwise, since the CLI would likely misbehave

We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default __main__ and entry point #3051, ideally making them independent from our Click wrappers and avoid having 5 different context-dependent ways of running a Kedro project #3237 but maybe there are other possible solutions.

@noklam
Copy link
Contributor Author

noklam commented Feb 28, 2024

Leave a comment here: #3237 (comment). This goes beyond Databricks so I want to keep the conversation on the broader topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: implementation Tech Design topic on implementation of the issue
Projects
Status: No status
Development

No branches or pull requests

5 participants