-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client API error handling overheal in a new submission call #5244
base: master
Are you sure you want to change the base?
Conversation
c7d2933
to
270856a
Compare
270856a
to
19ac545
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should not use
SerdeEncodable<Result<_, _>>
that's just horrible (not extensible, hard to introspect, …). The new error type looks good to me. - Sticking to standards is nice, but given that there likely won't be a second client impl it probably isn't something we should expand too much energy on. Do error codes actually help us long term?
- Touching the query strategy makes me feel uneasy, huge potential to break stuff and cause us a lot of debugging.
// If data doesn't match, that's OK, just ignore it | ||
for ((code, message), count) in &count_by_no_data { | ||
if num_peers.threshold() <= *count { | ||
return Some(ApiError { | ||
code: *code, | ||
message: message.clone(), | ||
data: None, | ||
}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it generally true that we can ignore data? Feels like it differs case-by-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reality is that there can be consensus on the code, but not on the details. It's up to the caller to figure it out what to do about it. I would expect most client calls would be OK with just a code. Snooping in the data seems brittle anyway, and I would expect it to be useful mostly for debugging.
// If there's a consensus on the code, then go with that code, and pick any | ||
// message. Better than nothing, and message is not all that important, maybe | ||
// it is just a spelling difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be a debugging nightmare imo. Can't we somehow preserve all the messages? Why are we trying to treat n API calls like one? For state machines to make decisions we should use much clearer semantics and not just "maybe we agreed on the data, maybe on only the code".
#[derive(Debug, Error, Clone, Eq, PartialEq, Serialize, Deserialize)] | ||
#[serde(tag = "type")] | ||
pub enum TransactionError { | ||
#[error("The transaction {txid} is unbalanced (in={inputs}, out={outputs}, fee={fee})")] | ||
UnbalancedTransaction { | ||
inputs: Amount, | ||
outputs: Amount, | ||
fee: Amount, | ||
txid: TransactionId, | ||
}, | ||
#[error("The transaction's {txid} signature is invalid: tx={tx}, sig={sig}, key={key}")] | ||
InvalidSignature { | ||
txid: TransactionId, | ||
tx: String, | ||
sig: String, | ||
key: String, | ||
}, | ||
#[error("The transaction's {txid} signature scheme is not supported: variant={variant}")] | ||
UnsupportedSignatureScheme { txid: TransactionId, variant: u64 }, | ||
#[error("The transaction {txid} did not have the correct number of signatures")] | ||
InvalidWitnessLength { txid: TransactionId }, | ||
#[error("The transaction {txid} had an invalid input at index {}: {}", .input_idx, .error)] | ||
Input { | ||
#[serde(with = "crate::encoding::as_hex")] | ||
error: DynInputError, | ||
input_idx: u64, | ||
txid: TransactionId, | ||
}, | ||
#[error("The transaction {txid} had an invalid output at index {}: {}", .output_idx, .error)] | ||
Output { | ||
#[serde(with = "crate::encoding::as_hex")] | ||
error: DynOutputError, | ||
output_idx: u64, | ||
txid: TransactionId, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this error is an improvement imo. nit: could factor out the txid
.
if let Err(e) = context.api().submit_transaction(tx.clone()).await { | ||
if let Some(call_error) = e.try_to_call_error(context.num_peers()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could avoid nesting by using map_err
if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() { | ||
if let Err(e) = context.api().submit_transaction(tx.clone()).await { | ||
if let Some(call_error) = e.try_to_call_error(context.num_peers()) { | ||
return call_error.message; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this new message preserves most of the information the previous one had, so this seems ok.
let call_error_count = self.errors.iter().filter(|(_, e)| e.is_call_err()).count(); | ||
let non_call_error_count = self.errors.len() - call_error_count; | ||
if non_call_error_count == self.num_peers.one_honest() { | ||
// If there enough non-application errors that there's no way to get consensus, | ||
// there's no reason to continue. | ||
QueryStep::Failure { | ||
general: Some(anyhow!( | ||
"Received {} out of {} non-call errors from peers: {}", | ||
self.num_peers.threshold(), | ||
self.num_peers, | ||
self.format_errors() | ||
)), | ||
peers: mem::take(&mut self.errors), | ||
} | ||
} else if call_error_count == self.num_peers.threshold() { | ||
// For a call-errors to surface as a federation-level, it needs get a threshold | ||
// of responses being call-errors. | ||
QueryStep::Failure { | ||
general: Some(anyhow!( | ||
"Received errors from {} peers: {}", | ||
self.threshold, | ||
"Received {} out of {} call errors from peers: {}", | ||
self.num_peers.threshold(), | ||
self.num_peers, | ||
self.format_errors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the query strategy to be more clever here? I guess it makes sense for later optimizations that back off if we are offline, but for now we'll re-try anyway afaik, so doesn't matter if we keep re-trying inside the strategy or using external loops.
Also, these query strategies are so central to the client that any behavior change should get a test that it actually does the intended thing (would probably be worth fuzz testing even).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the different threshold here since a query strategy should fail on f + 1 rpc errors, however on call errors (like TransactionError) we need a threshold of 2f + 1 to achieve consensus; if we use f+1 for call errors. Imo rpc errors and TransactionErrors are two very different concepts and should not be squashed into the same struct just because they are both called "Errors".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squashing these errors requires us to always keep in mind to call is_call_err to differentiate when handling them in the client; for example only retry when is_call_err returns false.
} | ||
if SUBMIT_ENDPOINT_API_VERSION <= context.core_api_version() { | ||
if let Err(e) = context.api().submit_transaction(tx.clone()).await { | ||
if let Some(call_error) = e.try_to_call_error(context.num_peers()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to try_to_call_error
changes the semantics here and relaxes consensus requirements. Previously peers had to agree on the error data, now it can be only the error code in the worst case. Do we want that? Have we thought of all the potential side-effects?
Concept NACK: Rpc errors and transaction errors are two completely different concepts and should not be squashed into one; For example Rpc errors can be retried while transaction errors are final and usually imply a critical failure like a faulty federation. Squashing this into one requires us to always differentiate via the is_call_err method to check and apply different logic, like different thresholds in the query strategy for example. The additional complexity introduced in code as central as the query strategies is imo not worth it; I feel very strongly about this as query strategies are generally very easy to mess up. In fact it seems to me like this pr already introduced a race condition since transaction errors will now not be retried by ThresholdConsensus anymore. ThresholdConsensus is designed for updating values so it will retry to get new values if it cannot establish consensus, however it will not retry rpc errors so it still fails in case of an offline federation. Now consider the case where a transaction becomes valid - like claiming an incoming contract - in a 3/4 federation. Assume guardian 0 is offline while the incoming contract is confirmed (hence 1,2,3 re online). The client will see the incoming contract as confirmed and create a transaction to claim it by submitting it to all peers via ThresholdConsensus. No guardian 0 comes back online but is behind the federations status while guardian 3 goes offline. The client will now receive a transaction error on submission from guardian 0 since it is behind, Ok(txid) from guardian 1,2 and an rpc error from guardian 3. Currently, ThresholdConsensus would retry guardian 0 until it returns Ok(txid) as well and return Ok(txid) as the federations consensus. However with this pr guardian 0 is not retried and also the ErrorStrategy cannot establish a threshold of either call or non call errors of 3 or 2 since it only has one of each. In the current code this would lead to a panic with "Query strategy ran out of peers to query without returning a result", however with your pr the query strategy would at least fail and return an error. As you can see the behaviour of those strategies is quite subtle and I would avoid further complexity here at all costs to keep this maintainable. Actually I would much prefer it to further simplify the query strategies instead like I did in #5308 |
We need some way to enumerate potential failure cases. Using Error codes are robust way to enumerate and distinguish error/outcome cases. And since client side needs to get a consensus on the error itself, getting consensus on error code itself, seems most natural. Otherwise if consensus on the application error is to be established on the entirety of the error (like it's encoding or serialization), it's impossible to even fix a typo in an error. I'm not strongly set on jsonrpc error codes ... but it does have nice properties, like not requiring another level of wrapping error in a success on the wire.
There are two different aspects here:
Conceptually one way or the other the result of an api call to a peer has 3 outcomes:
The code in this PR. is clumsy w.r.t. query strategy, because it doesn't convert the underlying jsonrpc error to something like
Instead of being scared of it, maybe we should add some tests. Query strategies are (or at least should be) very convenient to test. We call them with responses, they give back outputs that we can assert. There are lots of places in the code that deal with mutable state, and have side-effects, etc. I don't find query strategies particularly scary. |
To sum up:
|
First of all, I think @joschisan, @dpc and I should get on a call about this. You both have valid points and trying to get to a solution that everyone can agree on is very hard asynchronously.
Could we just add a default variant? We need to model possible variants very rigidly here anyway since adding an unknown error that old clients can't interpret could lead to backwards incompatibility (e.g. they wouldn't know if it's retryable).
The encoding of the error in question doesn't use strings, but enums (and thus indices).
Well, I remember us having some very nasty bugs resulting from subtle errors in query strategies that @joschisan debugged for days back then. These were leading to very hard to debug, timing and online-ness dependent payment failures and loss of funds. Not messing them up again is a high priority. I second that they should be tested, but a comprehensive test suite would need to simulate different response orders, missing responses, etc. Fuzzing that would be great imo or possibly even exhaustive testing if the combinatorial complexity isn't too bad. |
We could, but if the query strategy is just comparing whole encoding, then any new variant will not be considered the same error, even if it supposed to be.
Right, my bad. So let's say - adding a new field, so it's possible to print the extra information. Adding a new field requires a new variant, even if that new variant means exactly same thing as the old variant, just carries more data. The variants are an enumeration of possible ... encoded states. Error codes are an enumeration of possible "not OK logical outcomes". Coupling them is the issue.
Good point. We might not be able to add any new error codes to the existing rpc calls altogether. If you need more error codes, it needs to be a new api call, on a new api version. BTW. That also made me realize that in a Also - right now we completely ignore the transaction details. If enough peers return transaction error, it is just ignored. So possibly submit should just return a single application error code "transaction invalid" and that's about it. Everything else would be just "extra metdata" for debugging purposes. I know that @joschisan was describing submit errors being useful for lnv2, but maybe that should be facilitated by another module-specific call? After lnv2-client-module tx gets "transaction invalid" from "submit", it can call some extra endpoint to figure out what to do about. This way it's actually possible to evolve the whole thing. Module specific errors being returned from "submit" seems just wrong anyway.
The timing doesn't matter, only the order of responses. So nothing fancy is really needed: let mut query_strategy = SomeQueryStragy::new();
assert_eq!(query_strategy.process(PeerId(0), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(1), SomeFakeResponse { a: 1}), QueryStep::Continue);
assert_eq!(query_strategy.process(PeerId(2), SomeFakeResponse { a: 1}), QueryStep::Success); Sure, one could write a property test (fuzzing seems too low level, property testing fits better here) to test a lot of combinations, but in practice there are only a handful of boundary conditions worth testing for dpc accidentally breaking them in a freak refactoring accident. |
Why waste compute on validating a tx that will not be accepted?
Then we'd need to save information about rejected transactions, which iirc @joschisan specifically removed when migrating to AlephBFT (likely because DoS risk).
I was thinking of (ab)using fuzzer input for choosing message ordering 😅 I need to read up on what property testing actually is 🙈 I guess something similar, just less hacky. |
I don't mean we should accept it. Federation's
No we wouldn't. On rejected transaction the client, could e.g. send that whole transaction (or relevant parts) to e.g. LNv2 module API, so that module can validate LNv2 parts and respond to the client with whatever details the client needs to do what it needs to do. To sum up: What I'm saying is that |
Introduce a new tx submission call
See #5238
To achieve that some more changes were needed.
Overhaul error handling
Use jsonrpc error codes and
ApiError
as a carrier for api errors. This is a big change, that requires some thinking through.JsonRPC standard says that all non-server-defied error codes are for application use, which jsonrpsee calls "Call"-errors.
It would generally make our errors much more idiomatic, which should make 3rd-party clients and tools job easier. Over time we should stop relying on error messages and use error codes, which now become first-class citizen.
The errors will now not be serialized, but JsonRPC error also has an
data
that can encode free form extra data. If needed even consensus-encoded fields can be passed in it (like I do forDynInputError
etc. here).Adapt strategy error handling
The "call" errors need to get consensus on, in a similar way that normal values do. This requires some tweak to error strategy.