Enhance TaskContext and add task failure root cause #3410

mingmwang · 2022-09-09T07:30:30Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

In current Datafusion/Ballista plan's execution path, the error result mapping is a chaos, the root cause of the task failure is not clear, especially when the plan tree contains pipeline breakers.
To make the error handling clear and make it easy to reasoning about the real task failure reason, the suggestion is to add a new method to TaskContext to set the task failure root cause into TaskContext.

Below is an example of Ballista shuffle read execution path, the arrow '->' represents the error/result return flow.

BallistaClient.fetch_partition(): BallistaError::General -> Ballista ShuffleReaderExec.fetch_partition(): DataFusionError::Execution -> Ballista ShuffleReaderExec.execute() : SendableRecordBatchStream -> ->RecordBatchStreamAdapter.poll_next(): ArrowResult::ExternalError ->

We can see the root error is mapped to Ballista Error -> DataFusion Error -> Arrow Error. And depends on how the
stream is polled and consumed, the Arrow Error can be mapped to DataFusion Error or Ballista Error again.
There are back and forth error mapping in the code base, this makes the root case reasoning become a challenge.

Want to hear your thoughts.

@andygrove @alamb @thinkharderdev

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

mingmwang · 2022-09-09T08:32:51Z

@yahoNanJing

alamb · 2022-09-09T14:57:39Z

I agree the wrapping is confusing and I think in general the wrapping of errors is not valuable. We see similar wrapping obfuscation in IOx as well.

The datafusion code already tries to unwrap an ArrowError here (when converting from DataFusionError --> ArrowError)

https://github.com/apache/arrow-datafusion/blob/master/datafusion/common/src/error.rs#L209-L217

I wonder if we can do something similar in Ballista? Or maybe write some "unwrapper" that would strip out the useless wrapping errors before display.

yahoNanJing · 2022-09-09T15:41:41Z

Thanks Alamb, the context here is not just for displaying. Recently we are working on error handling and error recovering which is useful for distributed execution. And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

alamb · 2022-09-10T09:59:42Z

And we hope to do error recovering for some special error case and we need to know the root cause of the error. If an error is wrapped too much, it will be very difficult to make decision of which error to recover and others not.

That is a neat usecase and it makes sense.

Arrow and DataFusion don't have very specific errors (unlike some other rust crate where the specific error call site often has its own unique variant). This makes coding easier as there is less error machinery to maintain, but it makes distinguishing between errors potentially more challenging -- All you have is the handful of variants and a message.

As the rest of the rust ecosystem works on stabalizing "error context" (aka being able to walk up the chain of error messages) maybe finding the root cause will become simpler.

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

yahoNanJing · 2022-09-10T15:30:00Z

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

Agree that it's a more standard way and it would be easier for error recovering purpose. @mingmwang, what do you think?

mingmwang · 2022-09-13T02:11:47Z

@alamb @yahoNanJing This would help and this approach is more rusty. But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily.
If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

        tokio::spawn(async move {

        .............................
        });

alamb · 2022-09-13T13:21:06Z

But it still depends on the programer to propagate the errors correctly in the execution path. And there is another case that we still could not catch the error easily.
If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Yes I agree care would be needed in various cases. I am not opposed to adding information to the TaskContext -- perhaps we can brainstorm additional improvements to errors / error handling as a follow on project. I know @tustvold has some ideas in apache/arrow-rs#2711 etc so maybe it would be part of a broader ecosystem improvment

tustvold · 2022-09-13T13:32:49Z

If we add such info to TaskContext, errors can be propagated easily and there is less burden to the programer.

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics. This has been a frequent pain point, see the work by @crepererum to fix this https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+is%3Aclosed+author%3Acrepererum. Longer term I'm hoping to move away from tokio see #2504, in part to do away with these types of problems, but I'm not sure when I'll get back to working on that. In the meantime, a tokio::spawn that doesn't somehow await the JoinHandle is an anti-pattern IMO.

additional improvements to errors / error handling

If we are to go down this route, I think the question to ask is "what users are there of the current structured error handling". In particular, could we just use something like anyhow which we already have a transitive dependency on as a result of prost... This has a fairly straightforward API for downcasting errors - https://docs.rs/anyhow/latest/anyhow/trait.Context.html#effect-on-downcasting

And we hope to do error recovering for some special error case and we need to know the root cause of the error.

@mingmwang Perhaps you could expand on the particular error cases you are looking to handle. The major one that comes to my mind are network failures, which I would have thought would make more sense to handle at the layer responsible for network calls. Perhaps see https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L121 for inspiration??

I mostly mention this as maybe it would be a good time to figure out if we can add a source: of each ArrowError, and DataFusionError as a more general purpose method than adding such info to the TaskContext

This seems to be where the ecosystem as a whole is headed, and makes sense to me. See rust-lang/rust#58520 for work to stabilize a mechanism to "walk" the tree

crepererum · 2022-09-13T13:53:06Z

Calling tokio::spawn without doing anything with the join handle is not a good idea, not only will you potentially silently drop errors, as you elude to, but more problematically you will miss panics.

It's not only about error handling though but also about task cancellation. Esp. in a server application, it's good to stop a query from running when the original request is cancelled. Uncaptured task handles will prevent this from happening, since tokio will still drive the execution.

alamb · 2022-09-13T19:13:27Z

See rust-lang/rust#58520 for work to stabilize a mechanism to "walk" the tree

Yes, this is what I really want out of errors in DataFusion and Arrow

alamb · 2022-09-14T12:32:08Z

Filed apache/arrow-rs#2725 to discuss better error handling in arrow (and thus also datafusion and ballista)

mingmwang added the enhancement New feature or request label Sep 9, 2022

alamb mentioned this issue Sep 14, 2022

Ability to get "source" / and error stack from an ArrowError to help debugging apache/arrow-rs#2725

Open

mingmwang mentioned this issue Nov 30, 2022

Helper to get "root" error #4435

Closed

comphead mentioned this issue Aug 21, 2023

Improve DataFusion error handling #7360

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance TaskContext and add task failure root cause #3410

Enhance TaskContext and add task failure root cause #3410

mingmwang commented Sep 9, 2022 •

edited

mingmwang commented Sep 9, 2022

alamb commented Sep 9, 2022 •

edited

yahoNanJing commented Sep 9, 2022

alamb commented Sep 10, 2022

yahoNanJing commented Sep 10, 2022 •

edited

mingmwang commented Sep 13, 2022 •

edited

alamb commented Sep 13, 2022

tustvold commented Sep 13, 2022 •

edited

crepererum commented Sep 13, 2022

alamb commented Sep 13, 2022

alamb commented Sep 14, 2022

Enhance TaskContext and add task failure root cause #3410

Enhance TaskContext and add task failure root cause #3410

Comments

mingmwang commented Sep 9, 2022 • edited

mingmwang commented Sep 9, 2022

alamb commented Sep 9, 2022 • edited

yahoNanJing commented Sep 9, 2022

alamb commented Sep 10, 2022

yahoNanJing commented Sep 10, 2022 • edited

mingmwang commented Sep 13, 2022 • edited

alamb commented Sep 13, 2022

tustvold commented Sep 13, 2022 • edited

crepererum commented Sep 13, 2022

alamb commented Sep 13, 2022

alamb commented Sep 14, 2022

mingmwang commented Sep 9, 2022 •

edited

alamb commented Sep 9, 2022 •

edited

yahoNanJing commented Sep 10, 2022 •

edited

mingmwang commented Sep 13, 2022 •

edited

tustvold commented Sep 13, 2022 •

edited