Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Open
alamb opened this issue Sep 14, 2022 · 10 comments
Open
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

(note I am filing this in arrow-rs rather than datafusion as the same applies to lower level arrow errors and we would love to follow the same model in both projects and because it just came up in the context of refactoring the arrow crate: #2711 (comment))

I also harbor perhaps unrealistic dreams, we can do something in arrow that is reasonable and show it works in a real set of projects, and then write about it / blog about it, use that as a bully pulpit to move some of the rust error projects along more speedily

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When developing IOx (which uses DataFusion, which uses Arrow) when an error is encountered, I get an error like this:

thread 'influxrpc::read_group::test_grouped_series_set_plan_first' panicked at 'running plans: NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)")', query_tests/src/influxrpc/util.rs:16:10
stack backtrace:

This is challenging for several reasons:

  1. I don't know were this error originated (aka what module / source / line number). The source and line number reported (util.rs in this case) is where the error was detected not the source of the error.
  2. Even if I can find where the error originated (by grepping the source code of arrow and datatafusion) I don't know what the call stack was when that code was invoked. In this case I don't know what type of plan was being converted when the error happened.

What I typically do (please don't laugh) to debug such errors is:

  1. Patch my project to use a local checkout of datafusion and/or arrow source code
  2. Change the error site in datafusion and/or arrow from returning an error to panic!
  3. Rerun my test with RUST_BACKTRACE=1 set so I can get a backtrace

While this works it is both annoying and I suspect more than most users will be willing to bear as they don't already have local checkouts of arrow and datafusion.

What I really want is I want is errors in Arrow (and then also Datafusion) to provide a trace like what python provides. Stylistically:

DataFusionError::Anonymous(), make_plan, /my/datafusion_checkout/physical_planner.rs:100 
  DataFusionError::NotImplemented("Physical plan does not support logical expression selector_first(#h2o.b AS b, #h2o.time)"), make_plan,  /my/arrow/checkout/datatypes/schema.rs:42
    ArrowError::NotImplemented("RecordBatch schema doesn't match), /my/arrow/checkout/datatypes/schema.rs:42

Where the Anonymous is meant to signify a location where ? was used to propagate the error.

I can write flesh out these ideas more if anyone is interested.

Items I do not care about

Note I don't want to get into a discussion about providing runtime backtrace support. I would be very happy to only have function names (ideally with line numbers) from arrow / datafusion and any other projects that add the support explicitly. I would also be fine with using a proper backtrace in the implementation but I don't want this ticket to get bogged down like other RFCs seem to have)

I also would like to avoid requiring every error site to have a different error enum to get this feature (though whatever we do here shouldn't prevent adding new error variants for error structured reporting)

Describe the solution you'd like
What I would like is some way to annotate Arrow errors with the source location it came from as well as any causing error and a way to walk the chain. Perhaps we can start with some macros -- here is a crazy idea to start thinking about

// in source.rs
if do_the_thing() {
  Ok(())
} else {
  arrow_err!(Schema, "The thing errored: {}", 42)
}

Which on error would result in an error like

enum ArrowError {
  SchemaError { 
    message: String,
    source_file: Option<&'static str>,
    source_line: Option<&'static str>,
    cause: Option<Box<dyn Error>>,
  }
  ...
}

We could potentially use a similar macro for annotating the result of ? (somehow)

Describe alternatives you've considered
We could wait for any of the various Rust RFCs in error handling to stabilize such as https://rust-lang.github.io/rfcs/0201-error-chaining.html or https://rust-lang.github.io/rfcs/2504-fix-error.html.

However, given how long they have been outstanding I am not going to hold my breath.

Additional context
@yahoNanJing and @mingmwang are discussing similar things on apache/datafusion#3410 (comment), I believe

The most recent time I hit this was in https://github.com/influxdata/influxdb_iox/pull/5606

@tustvold @andygrove and I discussed error handling in Arrow as well in #2711 (comment)

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 14, 2022
@tustvold
Copy link
Contributor

tustvold commented Sep 14, 2022

Thank you for starting this conversation.

Further to the great points you raise, I think there are a couple of things worth highlighting about the approach that I think work relatively well and I think we should try to keep:

  • Additional information can be added to an error without this being a breaking change or necessitating new types or variants, there are downsides to using opaque strings, but their opacity is sometimes a strength
  • The use of common error types allows for effective use of ?, without having to derive From, which is subject to orphan restrictions, or use map_err or similar which bloats the actual code
  • The error handling is not doing anything sophisticated, this makes it easy for contributors, maintainers and users
  • The error handling by and large is pretty unobtrusive, the interface between ArrowError, DataFusionError, etc.. stands out in part because most of the time the error handling "just works"

I think there are also some points worth highlighting, that you allude to, but that are weaknesses of the current approach

  • The current variants are somewhat arbitrary, they are both too high level to be programmatically useful, and lower-level than necessary for just printing context to the user
  • It is non-trivial to add additional context to an existing error, for example, annotating a Not Implemented error with information about in what DataFusion operator it was encountered
  • There is no obvious way to programmatically get the "underlying cause" and potentially use this to drive high-level error handling (Enhance TaskContext and add task failure root cause datafusion#3410), the errors are too opaque for this need
  • Arrow, DataFusion and Ballista all have their own error enumeration, which results in cumbersome mapping logic back and forth, and seems unnecessary when they have identical needs for error handling

@alamb
Copy link
Contributor Author

alamb commented Sep 14, 2022

which results in cumbersome mapping logic back and forth, and seems unnecessary when they have identical needs for error handling

I agree the mapping logic is cumbersome and unnecessary in most cases (when it mostly needed to return the correct error type). I would say the errors needed are different, though the general purpose handling of them I suppose could be said to be the same

@tustvold
Copy link
Contributor

tustvold commented Sep 14, 2022

I would like to propose switching to using anyhow for a couple of reasons, but appreciate other opinions may differ.

Ecosystem Adoption

There is a wide ecosystem already using anyhow, it sits at 53M downloads currently. For reference tokio sits at 66M.

This means:

Simple and Ergonomic

It is easy to use, unobtrusive, and allows focusing on the actual logic

fn get_cluster_info() -> Result<ClusterMap> {
    let config = std::fs::read_to_string("cluster.json")?;
    let map: ClusterMap = serde_json::from_str(&config)?;
    Ok(map)
}

We can easily return different types of errors, without needing to define any custom structs, enumerations, etc... It just works and gets out of the way.

Additional Context

Can easily add additional context to an existing error, without this resulting in a breaking API change or needing to change focus from the function to some error declaration at the top of, or in some other file entirely.

fn main() -> Result<()> {
    ...
    it.detach().context("Failed to detach the important thing")?;

    let content = std::fs::read(path)
        .with_context(|| format!("Failed to read instrs from {}", path))?;
    ...
}

This will then get printed as

Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
    No such file or directory (os error 2)

Backtraces

Enabling the backtrace feature flag enables capturing and displaying backtraces if RUST_BACKTRACE=1 or RUST_LIB_BACKTRACE=1.

It is worth noting capturing backtraces is not cheap, however, in arrow and DataFusion errors are not used for control flow, but for exceptions, where the cost of a backtrace will be negligible compared to the cost of the failing network call, columnar kernel, etc...

Error: Failed to read instrs from ./path/to/instrs.json

Caused by:
    No such file or directory (os error 2)

Stack backtrace:
   0: <E as anyhow::context::ext::StdError>::ext_context
             at /git/anyhow/src/backtrace.rs:26
   1: core::result::Result<T,E>::map_err
             at /git/rustc/src/libcore/result.rs:596
   2: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
             at /git/anyhow/src/context.rs:58
   3: testing::main
             at src/main.rs:5
   4: std::rt::lang_start
             at /git/rustc/src/libstd/rt.rs:61
   5: main
   6: __libc_start_main
   7: _start

Downcasting

A given anyhow::Error can be downcast to a particular concrete error type, if that appears anywhere in the context chain. This means that

  • Adding additional context is not a breaking change
  • Programmatic handling of errors is still possible

It is hard to think of cases where it would make sense to perform this form of error handling within the context of a query engine, but perhaps we could envisage some system that can respond to OutOfMemory errors during query processing and select some alternative evaluation strategy, or something (I am aware there are better ways to handle this than relying on error propagation).

#[derive(Debug, Error)]
struct OutOfMemory {
   ....
}

fn fallible_alloc() -> Result<T> {
    ...
    bail!(OutOfMemory)
}

fn my_operator() -> Result<()> {
    let foo = fallible_alloc().context("my_operator")?;
    ...
}

fn handler() {
    match my_operator() {
        Ok(_) => todo!(),
        Err(e) if e.is::<OutOfMemory>() => {
            todo!(),
        }
        Err(e) => todo!()
    }
}

Here we can see that downcasting works, even when there are intervening operators adding additional context to aid in debugging

@alamb
Copy link
Contributor Author

alamb commented Sep 14, 2022

I think the example of reacting to OOM in #2725 (comment) looks very nice @tustvold

I read up on anyhow and I like what I saw. The docs on how display worked make a lot of sense to me https://docs.rs/anyhow/1.0.65/anyhow/struct.Error.html 👍

Also, the number of other crates which use it looks like it is widely adopted across the ecosystem (6172 when I looked): https://crates.io/crates/anyhow/reverse_dependencies.

I think the next step might be to make a small POC of what "switching to using anyhow" might look like and we could make a POC in datafusion to see what types of downstream changes would be needed potentially (I am happy to try the datafusion PR)

@alamb
Copy link
Contributor Author

alamb commented Sep 14, 2022

It is worth noting capturing backtraces is not cheap, however, in arrow and DataFusion errors are not used for control flow, but for exceptions, where the cost of a backtrace will be negligible compared to the cost of the failing network call, columnar kernel, etc...

It is also appears that the backtrace capture is controlled by environment variable so we can have even finer grained control

@tustvold
Copy link
Contributor

I think the next step might be to make a small POC

I think this is probably best done after #2594, not only will this avoid huge merge conflicts, but it would also provide a nice way to incrementally migrate one crate at a time, starting with the leaf crates.

This will also give some time to build consensus around this as a path forward

@crepererum
Copy link
Contributor

Notes on anyhow and a DataFusion PoV: It would be nice if we could have some kind of machine-readable "error kind", e.g. "resource exhausted". This would be important to generate proper end-user error codes (e.g. for HTTP interfaces). So I would expect that we still use common "leaf" error types.

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2022

Unless anybody objects I plan to start this process in the coming days, starting with the parquet crate as it is fairly self-contained, and there have been further community requests for this functionality in #3285.

I will send a note out to the mailing list shortly

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2022

I think making a proposal in code is probably a good next step so we can see exactly what it would look like

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2023

I think #3567 fixes the first part of this issue (getting the source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants