Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client API error handling overheal in a new submission call #5244

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dpc
Copy link
Contributor

@dpc dpc commented May 9, 2024

Introduce a new tx submission call

See #5238

To achieve that some more changes were needed.

Overhaul error handling

Use jsonrpc error codes and ApiError as a carrier for api errors. This is a big change, that requires some thinking through.

JsonRPC standard says that all non-server-defied error codes are for application use, which jsonrpsee calls "Call"-errors.

It would generally make our errors much more idiomatic, which should make 3rd-party clients and tools job easier. Over time we should stop relying on error messages and use error codes, which now become first-class citizen.

The errors will now not be serialized, but JsonRPC error also has an data that can encode free form extra data. If needed even consensus-encoded fields can be passed in it (like I do for DynInputError etc. here).

Adapt strategy error handling

The "call" errors need to get consensus on, in a similar way that normal values do. This requires some tweak to error strategy.

@dpc dpc requested review from a team as code owners May 9, 2024 17:55
@dpc dpc changed the title 24 05 08 redo submit tx Client API error handling overheal in a new submission call May 9, 2024
@dpc dpc force-pushed the 24-05-08-redo-submit-tx branch from c7d2933 to 270856a Compare May 9, 2024 18:11
@dpc dpc force-pushed the 24-05-08-redo-submit-tx branch from 270856a to 19ac545 Compare May 9, 2024 18:41
Cargo.toml Show resolved Hide resolved
@elsirion elsirion added the needs further review Merged before reaching ideal amount of code review label May 14, 2024
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should not use SerdeEncodable<Result<_, _>> that's just horrible (not extensible, hard to introspect, …). The new error type looks good to me.
  • Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?
  • Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.

Comment on lines +186 to +195
// If data doesn't match, that's OK, just ignore it
for ((code, message), count) in &count_by_no_data {
if num_peers.threshold() <= *count {
return Some(ApiError {
code: *code,
message: message.clone(),
data: None,
});
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it generally true that we can ignore data? Feels like it differs case-by-case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reality is that there can be consensus on the code, but not on the details. It's up to the caller to figure it out what to do about it. I would expect most client calls would be OK with just a code. Snooping in the data seems brittle anyway, and I would expect it to be useful mostly for debugging.

Comment on lines +197 to +199
// If there's a consensus on the code, then go with that code, and pick any
// message. Better than nothing, and message is not all that important, maybe
// it is just a spelling difference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be a debugging nightmare imo. Can't we somehow preserve all the messages? Why are we trying to treat n API calls like one? For state machines to make decisions we should use much clearer semantics and not just "maybe we agreed on the data, maybe on only the code".

Comment on lines +204 to +239
#[derive(Debug, Error, Clone, Eq, PartialEq, Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum TransactionError {
#[error("The transaction {txid} is unbalanced (in={inputs}, out={outputs}, fee={fee})")]
UnbalancedTransaction {
inputs: Amount,
outputs: Amount,
fee: Amount,
txid: TransactionId,
},
#[error("The transaction's {txid} signature is invalid: tx={tx}, sig={sig}, key={key}")]
InvalidSignature {
txid: TransactionId,
tx: String,
sig: String,
key: String,
},
#[error("The transaction's {txid} signature scheme is not supported: variant={variant}")]
UnsupportedSignatureScheme { txid: TransactionId, variant: u64 },
#[error("The transaction {txid} did not have the correct number of signatures")]
InvalidWitnessLength { txid: TransactionId },
#[error("The transaction {txid} had an invalid input at index {}: {}", .input_idx, .error)]
Input {
#[serde(with = "crate::encoding::as_hex")]
error: DynInputError,
input_idx: u64,
txid: TransactionId,
},
#[error("The transaction {txid} had an invalid output at index {}: {}", .output_idx, .error)]
Output {
#[serde(with = "crate::encoding::as_hex")]
error: DynOutputError,
output_idx: u64,
txid: TransactionId,
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this error is an improvement imo. nit: could factor out the txid.

Comment on lines +112 to +113
if let Err(e) = context.api().submit_transaction(tx.clone()).await {
if let Some(call_error) = e.try_to_call_error(context.num_peers()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could avoid nesting by using map_err

if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() {
if let Err(e) = context.api().submit_transaction(tx.clone()).await {
if let Some(call_error) = e.try_to_call_error(context.num_peers()) {
return call_error.message;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new message preserves most of the information the previous one had, so this seems ok.

Comment on lines +88 to 110
let call_error_count = self.errors.iter().filter(|(_, e)| e.is_call_err()).count();
let non_call_error_count = self.errors.len() - call_error_count;
if non_call_error_count == self.num_peers.one_honest() {
// If there enough non-application errors that there's no way to get consensus,
// there's no reason to continue.
QueryStep::Failure {
general: Some(anyhow!(
"Received {} out of {} non-call errors from peers: {}",
self.num_peers.threshold(),
self.num_peers,
self.format_errors()
)),
peers: mem::take(&mut self.errors),
}
} else if call_error_count == self.num_peers.threshold() {
// For a call-errors to surface as a federation-level, it needs get a threshold
// of responses being call-errors.
QueryStep::Failure {
general: Some(anyhow!(
"Received errors from {} peers: {}",
self.threshold,
"Received {} out of {} call errors from peers: {}",
self.num_peers.threshold(),
self.num_peers,
self.format_errors()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the query strategy to be more clever here? I guess it makes sense for later optimizations that back off if we are offline, but for now we'll re-try anyway afaik, so doesn't matter if we keep re-trying inside the strategy or using external loops.

Also, these query strategies are so central to the client that any behavior change should get a test that it actually does the intended thing (would probably be worth fuzz testing even).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the different threshold here since a query strategy should fail on f + 1 rpc errors, however on call errors (like TransactionError) we need a threshold of 2f + 1 to achieve consensus; if we use f+1 for call errors. Imo rpc errors and TransactionErrors are two very different concepts and should not be squashed into the same struct just because they are both called "Errors".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squashing these errors requires us to always keep in mind to call is_call_err to differentiate when handling them in the client; for example only retry when is_call_err returns false.

}
if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() {
if let Err(e) = context.api().submit_transaction(tx.clone()).await {
if let Some(call_error) = e.try_to_call_error(context.num_peers()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to try_to_call_error changes the semantics here and relaxes consensus requirements. Previously peers had to agree on the error data, now it can be only the error code in the worst case. Do we want that? Have we thought of all the potential side-effects?

@joschisan
Copy link
Contributor

joschisan commented May 21, 2024

Concept NACK: Rpc errors and transaction errors are two completely different concepts and should not be squashed into one; For example Rpc errors can be retried while transaction errors are final and usually imply a critical failure like a faulty federation. Squashing this into one requires us to always differentiate via the is_call_err method to check and apply different logic, like different thresholds in the query strategy for example. The additional complexity introduced in code as central as the query strategies is imo not worth it; I feel very strongly about this as query strategies are generally very easy to mess up.

In fact it seems to me like this pr already introduced a race condition since transaction errors will now not be retried by ThresholdConsensus anymore. ThresholdConsensus is designed for updating values so it will retry to get new values if it cannot establish consensus, however it will not retry rpc errors so it still fails in case of an offline federation. Now consider the case where a transaction becomes valid - like claiming an incoming contract - in a 3/4 federation.

Assume guardian 0 is offline while the incoming contract is confirmed (hence 1,2,3 re online). The client will see the incoming contract as confirmed and create a transaction to claim it by submitting it to all peers via ThresholdConsensus. No guardian 0 comes back online but is behind the federations status while guardian 3 goes offline. The client will now receive a transaction error on submission from guardian 0 since it is behind, Ok(txid) from guardian 1,2 and an rpc error from guardian 3. Currently, ThresholdConsensus would retry guardian 0 until it returns Ok(txid) as well and return Ok(txid) as the federations consensus. However with this pr guardian 0 is not retried and also the ErrorStrategy cannot establish a threshold of either call or non call errors of 3 or 2 since it only has one of each.

In the current code this would lead to a panic with "Query strategy ran out of peers to query without returning a result", however with your pr the query strategy would at least fail and return an error.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable. Actually I would much prefer it to further simplify the query strategies instead like I did in #5308

@dpc
Copy link
Contributor Author

dpc commented May 22, 2024

  • Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?

We need some way to enumerate potential failure cases. Using Decodable enum was kinda trying to do that, but is just too rigid on every other account.

Error codes are robust way to enumerate and distinguish error/outcome cases.

And since client side needs to get a consensus on the error itself, getting consensus on error code itself, seems most natural. Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

I'm not strongly set on jsonrpc error codes ... but it does have nice properties, like not requiring another level of wrapping error in a success on the wire.

  • Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.

There are two different aspects here:

  • How is the application-level (aka transaction) error represented on the wire.
  • How is it implemented in the client code.

Conceptually one way or the other the result of an api call to a peer has 3 outcomes:

enum ApiResult {
  TransportSuccess(CallOutcome)
  TransportError
}

enum CallOutcome {
  ApplicationError(...)
  ApplicationSuccess
}

The code in this PR. is clumsy w.r.t. query strategy, because it doesn't convert the underlying jsonrpc error to something like ApiResult. I think with a bit better refactoring, query strategy logic could largely remain intact.

As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable.

Instead of being scared of it, maybe we should add some tests.

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

There are lots of places in the code that deal with mutable state, and have side-effects, etc. I don't find query strategies particularly scary.

@dpc
Copy link
Contributor Author

dpc commented May 22, 2024

To sum up:

  • I agree that changes to the error handling query strategies here are clumsy, but I think it can be done better. It didn't occurred to me at the time, and I wanted to wrap it up so there's something to talk about.
  • When client needs to establish some form of consensus on the non-success application-level outcome, what should it do it on exactly. Codes, error messages, some serialization of the whole thing? Anything you do it on, makes that thing near-unevolvable, as it will break client-side consensus forming. Because of it, it seems to me that some form of minimal "codes" is the way to go. "message" and "data" in my mind should be used for debugging purposes only, or very special circumstances. And I feel (much less strongly) that if it will be "codes", we might as well flatten it into "jsonrpc error" as it has both code, message and data.
  • If client code would "unflatten" the "flatten" jsonrpc error codes early, so that the query strategy logic can remain the same, would using jsonrpc error codes would be OK with you? If not - what else would you prefer and why?

@elsirion @joschisan

@elsirion
Copy link
Contributor

First of all, I think @joschisan, @dpc and I should get on a call about this. You both have valid points and trying to get to a solution that everyone can agree on is very hard asynchronously.

Using Decodable enum was kinda trying to do that, but is just too rigid

Could we just add a default variant? We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert.

Well, I remember us having some very nasty bugs resulting from subtle errors in query strategies that @joschisan debugged for days back then. These were leading to very hard to debug, timing and online-ness dependent payment failures and loss of funds. Not messing them up again is a high priority.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc. Fuzzing that would be great imo or possibly even exhaustive testing if the combinatorial complexity isn't too bad.

@dpc
Copy link
Contributor Author

dpc commented May 22, 2024

Could we just add a default variant?

We could, but if the query strategy is just comparing whole encoding, then any new variant will not be considered the same error, even if it supposed to be.

The encoding of the error in question doesn't use strings, but enums (and thus indices).

Right, my bad. So let's say - adding a new field, so it's possible to print the extra information. Adding a new field requires a new variant, even if that new variant means exactly same thing as the old variant, just carries more data.

The variants are an enumeration of possible ... encoded states. Error codes are an enumeration of possible "not OK logical outcomes". Coupling them is the issue.

We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).

Good point. We might not be able to add any new error codes to the existing rpc calls altogether. If you need more error codes, it needs to be a new api call, on a new api version.

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one? That could possibly spook some logic somewhere.

Also - right now we completely ignore the transaction details. If enough peers return transaction error, it is just ignored. So possibly submit should just return a single application error code "transaction invalid" and that's about it. Everything else would be just "extra metdata" for debugging purposes.

I know that @joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Module specific errors being returned from "submit" seems just wrong anyway.

I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc.

The timing doesn't matter, only the order of responses. So nothing fancy is really needed:

let mut query_strategy = SomeQueryStragy::new();
assert_eq!(query_strategy.process(PeerId(0), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(1), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(2), SomeFakeResponse { a: 1}), QueryStep::Success);

Sure, one could write a property test (fuzzing seems too low level, property testing fits better here) to test a lot of combinations, but in practice there are only a handful of boundary conditions worth testing for dpc accidentally breaking them in a freak refactoring accident.

@dpc dpc marked this pull request as draft May 22, 2024 22:40
@elsirion
Copy link
Contributor

elsirion commented May 23, 2024

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I know that joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing.

Then we'd need to save information about rejected transactions, which iirc @joschisan specifically removed when migrating to AlephBFT (likely because DoS risk).

property testing fits better here

I was thinking of (ab)using fuzzer input for choosing message ordering 😅 I need to read up on what property testing actually is 🙈 I guess something similar, just less hacky.

@dpc
Copy link
Contributor Author

dpc commented May 24, 2024

@elsirion

BTW. That also made me realize that in a submit, possibly multiple inputs/outputs can have issues, so why would we return only the first one?

Why waste compute on validating a tx that will not be accepted?

I don't mean we should accept it. Federation's submit call should reject it, but why would we return only details of the first error that was encountered. What if there are other errors? If client logic was to ever depend on the details of the rejection, there's no guarantee that such details will be returned, because there's always a possibility that some other irrelevant error wasn't encountered and returned first.

Then we'd need to save information about rejected transactions

No we wouldn't. On rejected transaction the client, could e.g. send that whole transaction (or relevant parts) to e.g. LNv2 module API, so that module can validate LNv2 parts and respond to the client with whatever details the client needs to do what it needs to do.

To sum up: What I'm saying is that submit should just return Ok or Rejected. Anything else is unworkable, AFAICT. If any client logic needs to react to rejection in certain cases, it can use the rejection as a trigger, to use additional module-specific APIs to figure out the details it needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs further review Merged before reaching ideal amount of code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants