Ability to get "source" / and error stack from an ArrowError to help debugging #2725

alamb · 2022-09-14T11:26:17Z

(note I am filing this in arrow-rs rather than datafusion as the same applies to lower level arrow errors and we would love to follow the same model in both projects and because it just came up in the context of refactoring the arrow crate: #2711 (comment))

I also harbor perhaps unrealistic dreams, we can do something in arrow that is reasonable and show it works in a real set of projects, and then write about it / blog about it, use that as a bully pulpit to move some of the rust error projects along more speedily

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When developing IOx (which uses DataFusion, which uses Arrow) when an error is encountered, I get an error like this:

thread 'influxrpc::read_group::test_grouped_series_set_plan_first' panicked at 'running plans: NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)")', query_tests/src/influxrpc/util.rs:16:10
stack backtrace:

This is challenging for several reasons:

I don't know were this error originated (aka what module / source / line number). The source and line number reported (util.rs in this case) is where the error was detected not the source of the error.
Even if I can find where the error originated (by grepping the source code of arrow and datatafusion) I don't know what the call stack was when that code was invoked. In this case I don't know what type of plan was being converted when the error happened.

What I typically do (please don't laugh) to debug such errors is:

Patch my project to use a local checkout of datafusion and/or arrow source code
Change the error site in datafusion and/or arrow from returning an error to panic!
Rerun my test with RUST_BACKTRACE=1 set so I can get a backtrace

While this works it is both annoying and I suspect more than most users will be willing to bear as they don't already have local checkouts of arrow and datafusion.

What I really want is I want is errors in Arrow (and then also Datafusion) to provide a trace like what python provides. Stylistically:

DataFusionError::Anonymous(), make_plan, /my/datafusion_checkout/physical_planner.rs:100 
  DataFusionError::NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)"), make_plan,  /my/arrow/checkout/datatypes/schema.rs:42
    ArrowError::NotImplemented("RecordBatch schema doesn't match), /my/arrow/checkout/datatypes/schema.rs:42

Where the Anonymous is meant to signify a location where ? was used to propagate the error.

I can write flesh out these ideas more if anyone is interested.

Items I do not care about

Note I don't want to get into a discussion about providing runtime backtrace support. I would be very happy to only have function names (ideally with line numbers) from arrow / datafusion and any other projects that add the support explicitly. I would also be fine with using a proper backtrace in the implementation but I don't want this ticket to get bogged down like other RFCs seem to have)

I also would like to avoid requiring every error site to have a different error enum to get this feature (though whatever we do here shouldn't prevent adding new error variants for error structured reporting)

Describe the solution you'd like
What I would like is some way to annotate Arrow errors with the source location it came from as well as any causing error and a way to walk the chain. Perhaps we can start with some macros -- here is a crazy idea to start thinking about

// in source.rs
if do_the_thing() {
  Ok(())
} else {
  arrow_err!(Schema, "The thing errored: {}", 42)
}

Which on error would result in an error like

enum ArrowError {
  SchemaError { 
    message: String,
    source_file: Option<&'static str>,
    source_line: Option<&'static str>,
    cause: Option<Box<dyn Error>>,
  }
  ...
}

We could potentially use a similar macro for annotating the result of ? (somehow)

Describe alternatives you've considered
We could wait for any of the various Rust RFCs in error handling to stabilize such as https://rust-lang.github.io/rfcs/0201-error-chaining.html or https://rust-lang.github.io/rfcs/2504-fix-error.html.

However, given how long they have been outstanding I am not going to hold my breath.

Additional context
@yahoNanJing and @mingmwang are discussing similar things on apache/datafusion#3410 (comment), I believe

The most recent time I hit this was in https://github.com/influxdata/influxdb_iox/pull/5606

@tustvold @andygrove and I discussed error handling in Arrow as well in #2711 (comment)

The text was updated successfully, but these errors were encountered:

tustvold · 2022-09-14T11:47:32Z

Thank you for starting this conversation.

Further to the great points you raise, I think there are a couple of things worth highlighting about the approach that I think work relatively well and I think we should try to keep:

Additional information can be added to an error without this being a breaking change or necessitating new types or variants, there are downsides to using opaque strings, but their opacity is sometimes a strength
The use of common error types allows for effective use of ?, without having to derive From, which is subject to orphan restrictions, or use map_err or similar which bloats the actual code
The error handling is not doing anything sophisticated, this makes it easy for contributors, maintainers and users
The error handling by and large is pretty unobtrusive, the interface between ArrowError, DataFusionError, etc.. stands out in part because most of the time the error handling "just works"

I think there are also some points worth highlighting, that you allude to, but that are weaknesses of the current approach

The current variants are somewhat arbitrary, they are both too high level to be programmatically useful, and lower-level than necessary for just printing context to the user
It is non-trivial to add additional context to an existing error, for example, annotating a Not Implemented error with information about in what DataFusion operator it was encountered
There is no obvious way to programmatically get the "underlying cause" and potentially use this to drive high-level error handling (Enhance TaskContext and add task failure root cause datafusion#3410), the errors are too opaque for this need
Arrow, DataFusion and Ballista all have their own error enumeration, which results in cumbersome mapping logic back and forth, and seems unnecessary when they have identical needs for error handling

alamb · 2022-09-14T12:27:53Z

which results in cumbersome mapping logic back and forth, and seems unnecessary when they have identical needs for error handling

I agree the mapping logic is cumbersome and unnecessary in most cases (when it mostly needed to return the correct error type). I would say the errors needed are different, though the general purpose handling of them I suppose could be said to be the same

tustvold · 2022-09-14T13:37:33Z

I would like to propose switching to using anyhow for a couple of reasons, but appreciate other opinions may differ.

Ecosystem Adoption

There is a wide ecosystem already using anyhow, it sits at 53M downloads currently. For reference tokio sits at 66M.

This means:

It is well supported and maintained by people other than ourselves
It is well documented and understood by a wide subset of users
It integrates with, and is being used to drive improvements to the standard Error abstraction in Rust - https://github.com/rust-lang/rfcs/blob/master/text/2504-fix-error.md

Simple and Ergonomic

It is easy to use, unobtrusive, and allows focusing on the actual logic

fn get_cluster_info() -> Result<ClusterMap> {
    let config = std::fs::read_to_string("cluster.json")?;
    let map: ClusterMap = serde_json::from_str(&config)?;
    Ok(map)
}

We can easily return different types of errors, without needing to define any custom structs, enumerations, etc... It just works and gets out of the way.

Additional Context

Can easily add additional context to an existing error, without this resulting in a breaking API change or needing to change focus from the function to some error declaration at the top of, or in some other file entirely.

fn main() -> Result<()> {
    ...
    it.detach().context("Failed to detach the important thing")?;

    let content = std::fs::read(path)
        .with_context(|| format!("Failed to read instrs from {}", path))?;
    ...
}

This will then get printed as

Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
    No such file or directory (os error 2)

Backtraces

Enabling the backtrace feature flag enables capturing and displaying backtraces if RUST_BACKTRACE=1 or RUST_LIB_BACKTRACE=1.

It is worth noting capturing backtraces is not cheap, however, in arrow and DataFusion errors are not used for control flow, but for exceptions, where the cost of a backtrace will be negligible compared to the cost of the failing network call, columnar kernel, etc...

Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
    No such file or directory (os error 2)

Stack backtrace:
   0: <E as anyhow::context::ext::StdError>::ext_context
             at /git/anyhow/src/backtrace.rs:26
   1: core::result::Result<T,E>::map_err
             at /git/rustc/src/libcore/result.rs:596
   2: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
             at /git/anyhow/src/context.rs:58
   3: testing::main
             at src/main.rs:5
   4: std::rt::lang_start
             at /git/rustc/src/libstd/rt.rs:61
   5: main
   6: __libc_start_main
   7: _start

Downcasting

A given anyhow::Error can be downcast to a particular concrete error type, if that appears anywhere in the context chain. This means that

Adding additional context is not a breaking change
Programmatic handling of errors is still possible

It is hard to think of cases where it would make sense to perform this form of error handling within the context of a query engine, but perhaps we could envisage some system that can respond to OutOfMemory errors during query processing and select some alternative evaluation strategy, or something (I am aware there are better ways to handle this than relying on error propagation).

#[derive(Debug, Error)]
struct OutOfMemory {
   ....
}

fn fallible_alloc() -> Result<T> {
    ...
    bail!(OutOfMemory)
}

fn my_operator() -> Result<()> {
    let foo = fallible_alloc().context("my_operator")?;
    ...
}

fn handler() {
    match my_operator() {
        Ok(_) => todo!(),
        Err(e) if e.is::<OutOfMemory>() => {
            todo!(),
        }
        Err(e) => todo!()
    }
}

Here we can see that downcasting works, even when there are intervening operators adding additional context to aid in debugging

alamb · 2022-09-14T16:28:30Z

I think the example of reacting to OOM in #2725 (comment) looks very nice @tustvold

I read up on anyhow and I like what I saw. The docs on how display worked make a lot of sense to me https://docs.rs/anyhow/1.0.65/anyhow/struct.Error.html 👍

Also, the number of other crates which use it looks like it is widely adopted across the ecosystem (6172 when I looked): https://crates.io/crates/anyhow/reverse_dependencies.

I think the next step might be to make a small POC of what "switching to using anyhow" might look like and we could make a POC in datafusion to see what types of downstream changes would be needed potentially (I am happy to try the datafusion PR)

alamb · 2022-09-14T16:30:03Z

It is worth noting capturing backtraces is not cheap, however, in arrow and DataFusion errors are not used for control flow, but for exceptions, where the cost of a backtrace will be negligible compared to the cost of the failing network call, columnar kernel, etc...

It is also appears that the backtrace capture is controlled by environment variable so we can have even finer grained control

tustvold · 2022-09-14T17:16:22Z

I think the next step might be to make a small POC

I think this is probably best done after #2594, not only will this avoid huge merge conflicts, but it would also provide a nice way to incrementally migrate one crate at a time, starting with the leaf crates.

This will also give some time to build consensus around this as a path forward

crepererum · 2022-11-30T09:44:10Z

Notes on anyhow and a DataFusion PoV: It would be nice if we could have some kind of machine-readable "error kind", e.g. "resource exhausted". This would be important to generate proper end-user error codes (e.g. for HTTP interfaces). So I would expect that we still use common "leaf" error types.

tustvold · 2022-12-09T10:21:28Z

Unless anybody objects I plan to start this process in the coming days, starting with the parquet crate as it is fairly self-contained, and there have been further community requests for this functionality in #3285.

I will send a note out to the mailing list shortly

alamb · 2022-12-09T14:51:50Z

I think making a proposal in code is probably a good next step so we can see exactly what it would look like

alamb · 2023-01-30T14:40:00Z

I think #3567 fixes the first part of this issue (getting the source)

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 14, 2022

This was referenced Sep 14, 2022

Split out arrow-schema (#2594) #2711

Merged

Enhance TaskContext and add task failure root cause apache/datafusion#3410

Open

This was referenced Sep 20, 2022

Improve error propagation delta-io/delta-rs#812

Open

Reduce dependencies of datafusion-sql crate apache/datafusion#3566

Merged

Move Object Store to using std::io::Error #2348

Closed

tustvold mentioned this issue Dec 7, 2022

Add External variant to ParquetError #3285

Closed

This was referenced Jan 19, 2023

Panic on Key Overflow in Dictionary Builders #3562

Closed

Add external variant to ParquetError (#3285) #3574

Merged

tustvold mentioned this issue Aug 23, 2023

Add underlying std::io::Error to IoError and add IpcError variant #4726

Merged

tustvold mentioned this issue Nov 17, 2023

Improve C Data Interface and Add Integration Testing Entrypoints #5080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

alamb commented Sep 14, 2022 •

edited

tustvold commented Sep 14, 2022 •

edited

alamb commented Sep 14, 2022

tustvold commented Sep 14, 2022 •

edited

alamb commented Sep 14, 2022

alamb commented Sep 14, 2022

tustvold commented Sep 14, 2022

crepererum commented Nov 30, 2022

tustvold commented Dec 9, 2022

alamb commented Dec 9, 2022 •

edited

alamb commented Jan 30, 2023

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Comments

alamb commented Sep 14, 2022 • edited

tustvold commented Sep 14, 2022 • edited

alamb commented Sep 14, 2022

tustvold commented Sep 14, 2022 • edited

alamb commented Sep 14, 2022

alamb commented Sep 14, 2022

tustvold commented Sep 14, 2022

crepererum commented Nov 30, 2022

tustvold commented Dec 9, 2022

alamb commented Dec 9, 2022 • edited

alamb commented Jan 30, 2023

alamb commented Sep 14, 2022 •

edited

tustvold commented Sep 14, 2022 •

edited

tustvold commented Sep 14, 2022 •

edited

alamb commented Dec 9, 2022 •

edited