Client API error handling overheal in a new submission call #5244

dpc · 2024-05-09T17:55:34Z

Introduce a new tx submission call

To achieve that some more changes were needed.

Overhaul error handling

Use jsonrpc error codes and ApiError as a carrier for api errors. This is a big change, that requires some thinking through.

JsonRPC standard says that all non-server-defied error codes are for application use, which jsonrpsee calls "Call"-errors.

It would generally make our errors much more idiomatic, which should make 3rd-party clients and tools job easier. Over time we should stop relying on error messages and use error codes, which now become first-class citizen.

The errors will now not be serialized, but JsonRPC error also has an data that can encode free form extra data. If needed even consensus-encoded fields can be passed in it (like I do for DynInputError etc. here).

Adapt strategy error handling

The "call" errors need to get consensus on, in a similar way that normal values do. This requires some tweak to error strategy.

Cargo.toml

elsirion

We should not use SerdeEncodable<Result<_, _>> that's just horrible (not extensible, hard to introspect, …). The new error type looks good to me.
Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?
Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.

elsirion · 2024-05-21T11:41:59Z

fedimint-api-client/src/api.rs

+        // If data doesn't match, that's OK, just ignore it
+        for ((code, message), count) in &count_by_no_data {
+            if num_peers.threshold() <= *count {
+                return Some(ApiError {
+                    code: *code,
+                    message: message.clone(),
+                    data: None,
+                });
+            }
+        }


Is it generally true that we can ignore data? Feels like it differs case-by-case?

The reality is that there can be consensus on the code, but not on the details. It's up to the caller to figure it out what to do about it. I would expect most client calls would be OK with just a code. Snooping in the data seems brittle anyway, and I would expect it to be useful mostly for debugging.

elsirion · 2024-05-21T11:45:49Z

fedimint-api-client/src/api.rs

+        // If there's a consensus on the code, then go with that code, and pick any
+        // message. Better than nothing, and message is not all that important, maybe
+        // it is just a spelling difference.


That will be a debugging nightmare imo. Can't we somehow preserve all the messages? Why are we trying to treat n API calls like one? For state machines to make decisions we should use much clearer semantics and not just "maybe we agreed on the data, maybe on only the code".

elsirion · 2024-05-21T11:47:35Z

fedimint-core/src/transaction.rs

+#[derive(Debug, Error, Clone, Eq, PartialEq, Serialize, Deserialize)]
+#[serde(tag = "type")]
+pub enum TransactionError {
+    #[error("The transaction {txid} is unbalanced (in={inputs}, out={outputs}, fee={fee})")]
+    UnbalancedTransaction {
+        inputs: Amount,
+        outputs: Amount,
+        fee: Amount,
+        txid: TransactionId,
+    },
+    #[error("The transaction's {txid} signature is invalid: tx={tx}, sig={sig}, key={key}")]
+    InvalidSignature {
+        txid: TransactionId,
+        tx: String,
+        sig: String,
+        key: String,
+    },
+    #[error("The transaction's {txid} signature scheme is not supported: variant={variant}")]
+    UnsupportedSignatureScheme { txid: TransactionId, variant: u64 },
+    #[error("The transaction {txid} did not have the correct number of signatures")]
+    InvalidWitnessLength { txid: TransactionId },
+    #[error("The transaction {txid} had an invalid input at index {}: {}",  .input_idx, .error)]
+    Input {
+        #[serde(with = "crate::encoding::as_hex")]
+        error: DynInputError,
+        input_idx: u64,
+        txid: TransactionId,
+    },
+    #[error("The transaction {txid} had an invalid output at index {}: {}",  .output_idx, .error)]
+    Output {
+        #[serde(with = "crate::encoding::as_hex")]
+        error: DynOutputError,
+        output_idx: u64,
+        txid: TransactionId,
+    },
+}


Using this error is an improvement imo. nit: could factor out the txid.

elsirion · 2024-05-21T11:49:55Z

fedimint-client/src/transaction/sm.rs

+                if let Err(e) = context.api().submit_transaction(tx.clone()).await {
+                    if let Some(call_error) = e.try_to_call_error(context.num_peers()) {


nit: could avoid nesting by using map_err

elsirion · 2024-05-21T11:51:45Z

fedimint-client/src/transaction/sm.rs

+            if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() {
+                if let Err(e) = context.api().submit_transaction(tx.clone()).await {
+                    if let Some(call_error) = e.try_to_call_error(context.num_peers()) {
+                        return call_error.message;


I think this new message preserves most of the information the previous one had, so this seems ok.

elsirion · 2024-05-21T11:59:30Z

fedimint-api-client/src/query.rs

+        let call_error_count = self.errors.iter().filter(|(_, e)| e.is_call_err()).count();
+        let non_call_error_count = self.errors.len() - call_error_count;
+        if non_call_error_count == self.num_peers.one_honest() {
+            // If there enough non-application errors that there's no way to get consensus,
+            // there's no reason to continue.
+            QueryStep::Failure {
+                general: Some(anyhow!(
+                    "Received {} out of {} non-call errors from peers: {}",
+                    self.num_peers.threshold(),
+                    self.num_peers,
+                    self.format_errors()
+                )),
+                peers: mem::take(&mut self.errors),
+            }
+        } else if call_error_count == self.num_peers.threshold() {
+            // For a call-errors to surface as a federation-level, it needs get a threshold
+            // of responses being call-errors.
            QueryStep::Failure {
                general: Some(anyhow!(
-                    "Received errors from {} peers: {}",
-                    self.threshold,
+                    "Received {} out of {} call errors from peers: {}",
+                    self.num_peers.threshold(),
+                    self.num_peers,
                    self.format_errors()


Do we really need the query strategy to be more clever here? I guess it makes sense for later optimizations that back off if we are offline, but for now we'll re-try anyway afaik, so doesn't matter if we keep re-trying inside the strategy or using external loops.

Also, these query strategies are so central to the client that any behavior change should get a test that it actually does the intended thing (would probably be worth fuzz testing even).

We need the different threshold here since a query strategy should fail on f + 1 rpc errors, however on call errors (like TransactionError) we need a threshold of 2f + 1 to achieve consensus; if we use f+1 for call errors. Imo rpc errors and TransactionErrors are two very different concepts and should not be squashed into the same struct just because they are both called "Errors".

Squashing these errors requires us to always keep in mind to call is_call_err to differentiate when handling them in the client; for example only retry when is_call_err returns false.

elsirion · 2024-05-21T12:11:13Z

fedimint-client/src/transaction/sm.rs

-                        }
+            if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() {
+                if let Err(e) = context.api().submit_transaction(tx.clone()).await {
+                    if let Some(call_error) = e.try_to_call_error(context.num_peers()) {


Switching to try_to_call_error changes the semantics here and relaxes consensus requirements. Previously peers had to agree on the error data, now it can be only the error code in the worst case. Do we want that? Have we thought of all the potential side-effects?

joschisan · 2024-05-21T13:37:14Z

Concept NACK: Rpc errors and transaction errors are two completely different concepts and should not be squashed into one; For example Rpc errors can be retried while transaction errors are final and usually imply a critical failure like a faulty federation. Squashing this into one requires us to always differentiate via the is_call_err method to check and apply different logic, like different thresholds in the query strategy for example. The additional complexity introduced in code as central as the query strategies is imo not worth it; I feel very strongly about this as query strategies are generally very easy to mess up.

In fact it seems to me like this pr already introduced a race condition since transaction errors will now not be retried by ThresholdConsensus anymore. ThresholdConsensus is designed for updating values so it will retry to get new values if it cannot establish consensus, however it will not retry rpc errors so it still fails in case of an offline federation. Now consider the case where a transaction becomes valid - like claiming an incoming contract - in a 3/4 federation.

Assume guardian 0 is offline while the incoming contract is confirmed (hence 1,2,3 re online). The client will see the incoming contract as confirmed and create a transaction to claim it by submitting it to all peers via ThresholdConsensus. No guardian 0 comes back online but is behind the federations status while guardian 3 goes offline. The client will now receive a transaction error on submission from guardian 0 since it is behind, Ok(txid) from guardian 1,2 and an rpc error from guardian 3. Currently, ThresholdConsensus would retry guardian 0 until it returns Ok(txid) as well and return Ok(txid) as the federations consensus. However with this pr guardian 0 is not retried and also the ErrorStrategy cannot establish a threshold of either call or non call errors of 3 or 2 since it only has one of each.

In the current code this would lead to a panic with "Query strategy ran out of peers to query without returning a result", however with your pr the query strategy would at least fail and return an error.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable. Actually I would much prefer it to further simplify the query strategies instead like I did in #5308

dpc · 2024-05-22T03:26:53Z

Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?

We need some way to enumerate potential failure cases. Using Decodable enum was kinda trying to do that, but is just too rigid on every other account.

Error codes are robust way to enumerate and distinguish error/outcome cases.

And since client side needs to get a consensus on the error itself, getting consensus on error code itself, seems most natural. Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

I'm not strongly set on jsonrpc error codes ... but it does have nice properties, like not requiring another level of wrapping error in a success on the wire.

Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.

There are two different aspects here:

How is the application-level (aka transaction) error represented on the wire.
How is it implemented in the client code.

Conceptually one way or the other the result of an api call to a peer has 3 outcomes:

enum ApiResult {
  TransportSuccess(CallOutcome)
  TransportError
}

enum CallOutcome {
  ApplicationError(...)
  ApplicationSuccess
}

The code in this PR. is clumsy w.r.t. query strategy, because it doesn't convert the underlying jsonrpc error to something like ApiResult. I think with a bit better refactoring, query strategy logic could largely remain intact.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable.

Instead of being scared of it, maybe we should add some tests.

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

There are lots of places in the code that deal with mutable state, and have side-effects, etc. I don't find query strategies particularly scary.

dpc · 2024-05-22T03:42:04Z

To sum up:

I agree that changes to the error handling query strategies here are clumsy, but I think it can be done better. It didn't occurred to me at the time, and I wanted to wrap it up so there's something to talk about.
When client needs to establish some form of consensus on the non-success application-level outcome, what should it do it on exactly. Codes, error messages, some serialization of the whole thing? Anything you do it on, makes that thing near-unevolvable, as it will break client-side consensus forming. Because of it, it seems to me that some form of minimal "codes" is the way to go. "message" and "data" in my mind should be used for debugging purposes only, or very special circumstances. And I feel (much less strongly) that if it will be "codes", we might as well flatten it into "jsonrpc error" as it has both code, message and data.
If client code would "unflatten" the "flatten" jsonrpc error codes early, so that the query strategy logic can remain the same, would using jsonrpc error codes would be OK with you? If not - what else would you prefer and why?

@elsirion @joschisan

elsirion · 2024-05-22T12:54:20Z

First of all, I think @joschisan, @dpc and I should get on a call about this. You both have valid points and trying to get to a solution that everyone can agree on is very hard asynchronously.

Using Decodable enum was kinda trying to do that, but is just too rigid

Could we just add a default variant? We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

Well, I remember us having some very nasty bugs resulting from subtle errors in query strategies that @joschisan debugged for days back then. These were leading to very hard to debug, timing and online-ness dependent payment failures and loss of funds. Not messing them up again is a high priority.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc. Fuzzing that would be great imo or possibly even exhaustive testing if the combinatorial complexity isn't too bad.

dpc · 2024-05-22T21:53:37Z

Could we just add a default variant?

We could, but if the query strategy is just comparing whole encoding, then any new variant will not be considered the same error, even if it supposed to be.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Right, my bad. So let's say - adding a new field, so it's possible to print the extra information. Adding a new field requires a new variant, even if that new variant means exactly same thing as the old variant, just carries more data.

The variants are an enumeration of possible ... encoded states. Error codes are an enumeration of possible "not OK logical outcomes". Coupling them is the issue.

We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Good point. We might not be able to add any new error codes to the existing rpc calls altogether. If you need more error codes, it needs to be a new api call, on a new api version.

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one? That could possibly spook some logic somewhere.

Also - right now we completely ignore the transaction details. If enough peers return transaction error, it is just ignored. So possibly submit should just return a single application error code "transaction invalid" and that's about it. Everything else would be just "extra metdata" for debugging purposes.

I know that @joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Module specific errors being returned from "submit" seems just wrong anyway.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc.

The timing doesn't matter, only the order of responses. So nothing fancy is really needed:

let mut query_strategy = SomeQueryStragy::new();
assert_eq!(query_strategy.process(PeerId(0), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(1), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(2), SomeFakeResponse { a: 1}), QueryStep::Success);

Sure, one could write a property test (fuzzing seems too low level, property testing fits better here) to test a lot of combinations, but in practice there are only a handful of boundary conditions worth testing for dpc accidentally breaking them in a freak refactoring accident.

elsirion · 2024-05-23T11:34:19Z

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I know that joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Then we'd need to save information about rejected transactions, which iirc @joschisan specifically removed when migrating to AlephBFT (likely because DoS risk).

property testing fits better here

I was thinking of (ab)using fuzzer input for choosing message ordering 😅 I need to read up on what property testing actually is 🙈 I guess something similar, just less hacky.

dpc · 2024-05-24T03:37:58Z

@elsirion

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I don't mean we should accept it. Federation's submit call should reject it, but why would we return only details of the first error that was encountered. What if there are other errors? If client logic was to ever depend on the details of the rejection, there's no guarantee that such details will be returned, because there's always a possibility that some other irrelevant error wasn't encountered and returned first.

Then we'd need to save information about rejected transactions

No we wouldn't. On rejected transaction the client, could e.g. send that whole transaction (or relevant parts) to e.g. LNv2 module API, so that module can validate LNv2 parts and respond to the client with whatever details the client needs to do what it needs to do.

To sum up: What I'm saying is that submit should just return Ok or Rejected. Anything else is unworkable, AFAICT. If any client logic needs to react to rejection in certain cases, it can use the rejection as a trigger, to use additional module-specific APIs to figure out the details it needs.

dpc added 2 commits May 8, 2024 21:21

fix(ci): timeout tests before parallel job times out

a863459

chore: lower on_commit span level

c91cfce

dpc requested review from a team as code owners May 9, 2024 17:55

dpc changed the title ~~24 05 08 redo submit tx~~ Client API error handling overheal in a new submission call May 9, 2024

dpc force-pushed the 24-05-08-redo-submit-tx branch from c7d2933 to 270856a Compare May 9, 2024 18:11

chore: new, better submit endpoint

19ac545

dpc force-pushed the 24-05-08-redo-submit-tx branch from 270856a to 19ac545 Compare May 9, 2024 18:41

EthnTuttle reviewed May 10, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

EthnTuttle mentioned this pull request May 10, 2024

Use jsonrpsee from workspace #5248

Open

dpc mentioned this pull request May 10, 2024

chore: more informative error messages around accepting transactions #5227

Closed

elsirion added the needs further review Merged before reaching ideal amount of code review label May 14, 2024

elsirion reviewed May 21, 2024

View reviewed changes

dpc marked this pull request as draft May 22, 2024 22:40

dpc mentioned this pull request Jun 1, 2024

chore: api calls with call error codes #5409

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client API error handling overheal in a new submission call #5244

Client API error handling overheal in a new submission call #5244

dpc commented May 9, 2024 •

edited

elsirion left a comment

elsirion May 21, 2024

dpc May 22, 2024

elsirion May 21, 2024

elsirion May 21, 2024

elsirion May 21, 2024

elsirion May 21, 2024

elsirion May 21, 2024

joschisan May 21, 2024

joschisan May 21, 2024

elsirion May 21, 2024

joschisan commented May 21, 2024 •

edited

dpc commented May 22, 2024 •

edited

dpc commented May 22, 2024 •

edited

elsirion commented May 22, 2024

dpc commented May 22, 2024 •

edited

elsirion commented May 23, 2024 •

edited

dpc commented May 24, 2024 •

edited

		if let Err(e) = context.api().submit_transaction(tx.clone()).await {
		if let Some(call_error) = e.try_to_call_error(context.num_peers()) {

Client API error handling overheal in a new submission call #5244

Are you sure you want to change the base?

Client API error handling overheal in a new submission call #5244

Conversation

dpc commented May 9, 2024 • edited

Introduce a new tx submission call

Overhaul error handling

Adapt strategy error handling

elsirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joschisan commented May 21, 2024 • edited

dpc commented May 22, 2024 • edited

dpc commented May 22, 2024 • edited

elsirion commented May 22, 2024

dpc commented May 22, 2024 • edited

elsirion commented May 23, 2024 • edited

dpc commented May 24, 2024 • edited

dpc commented May 9, 2024 •

edited

joschisan commented May 21, 2024 •

edited

dpc commented May 22, 2024 •

edited

dpc commented May 22, 2024 •

edited

dpc commented May 22, 2024 •

edited

elsirion commented May 23, 2024 •

edited

dpc commented May 24, 2024 •

edited