Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization failure handling control #757

Open
wajda opened this issue Oct 23, 2023 · 2 comments · May be fixed by #758
Open

Initialization failure handling control #757

wajda opened this issue Oct 23, 2023 · 2 comments · May be fixed by #758
Assignees
Milestone

Comments

@wajda
Copy link
Contributor

wajda commented Oct 23, 2023

Add a configuration property to control how the agent should behave on initialization failures.
Currently when an error occurs during initialization phase of the agent (e.g. misconfiguration, failed handshake with the server etc) the error is logged and the Spark job carries on.

23/10/23 17:09:34 ERROR SparkLineageInitializer: Spline initialization failed! Spark Lineage tracking is DISABLED.
...
23/10/23 17:09:35 INFO CodeGenerator: Code generated in 77.142417 ms
23/10/23 17:09:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 198.9 KiB, free 4.6 GiB)
23/10/23 17:09:36 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.9 KiB, free 4.6 GiB)
23/10/23 17:09:36 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:59667 (size: 33.9 KiB, free: 4.6 GiB)
23/10/23 17:09:36 INFO SparkContext: Created broadcast 0 from csv at CodelessInitExampleJob.scala:34
23/10/23 17:09:36 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
23/10/23 17:09:36 INFO SparkContext: Starting job: csv at CodelessInitExampleJob.scala:34
23/10/23 17:09:36 INFO DAGScheduler: Got job 0 (csv at CodelessInitExampleJob.scala:34) with 1 output partitions
23/10/23 17:09:36 INFO DAGScheduler: Final stage: ResultStage 0 (csv at CodelessInitExampleJob.scala:34)
23/10/23 17:09:36 INFO DAGScheduler: Parents of final stage: List()

Such behavior was chosen with the aim to not affect the Spark job and do not interrupt potentially higher priority (from the operational perspective) processes. But sometimes, when lineage is mandatory for the user, they might prefer the Spark job to explicitly fail instead of silently continue without lineage tracking.
This behavior can be controlled by the config property:

spline.onInitFailure = LOG | BREAK

The default value would be LOG that corresponds to the current behavior. The BREAK mode would simply propagate the error to the Spark process causing the Spark job to fail.

@wajda wajda added this to the 2.1.0 milestone Oct 23, 2023
@wajda wajda changed the title Initialization failure behavior control Initialization failure handling control Oct 23, 2023
@wajda wajda self-assigned this Oct 23, 2023
@wajda wajda linked a pull request Oct 23, 2023 that will close this issue
@wajda
Copy link
Contributor Author

wajda commented Nov 2, 2023

I've got a better idea. Instead of a boolean property we could have a ErrorHandler interface with two default implementations:

  • strict (default) - would re-throw all exceptions
  • tolerant - would log exceptions and allow the Spark job to proceed.

The use could then implement its own logic if they want to (inspired by discussion with @yruslan).

Things for consideration:

  • should we only apply the error handler for codeless init or regardless? The reason why I doubting is that when doing a programmatic init the user can always catch and handle the exceptions themselves, so why to have two competing ways of doing the same thing?
  • Should we add a third implementation of the handler (e.g. called pragmatic or legacy) that would mimic the previous version behaviour?

@uday1409
Copy link
Contributor

uday1409 commented Nov 4, 2023

@wajda @cerveada

When lineage is initialized in codeless mode , i.e, having to install the jar on spark driver (spark env: databricks) using init scripts, and when initialization is failed due to lineage server being not available or other issues, the jobs would not be carried on as said above. Rather, driver becomes unresponsive and spark commands get cancelled automatically without any error.

To workaround this issue, I have used console as fallback dispatcher so that when there is a failure wrt to http server, jobs should process as is. Current behavior is that when server is not available, as per code, it just throws exception to the running environment and driver becomes unresponsive.

Please consider this when the changes are made wrt to this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants