Panic: called fail_with on already-failed connection state, from msg=GetData #1599

dconnolly · 2021-01-15T01:11:16Z

Duplicate

This panic will be fixed by the fail_with refactor - so this ticket is a duplicate of #1610.

Motivation

These panics are related to #1510 ~~which should have been fixed with~~ which was partly fixed by #1531.

Edit: we knew #1531 fixed some potential causes of #1510, but we weren't sure if we'd found them all

Issue report 1 is from commit 8a7b023.
Issue report 2 is from commit 1edb379 (#1620 + #1622).
The backtraces don't contain any useful information.

Analysis 2

A peer calls GetHeaders then GetData (get blocks) on the same connection.
GetHeaders (or some intervening request) fails with "Connection reset by peer".
Zebra panics because processing the GetData request fails with a "Broken pipe" error.

When the GetHeaders request fails, Zebra should stop processing any further requests from that peer. Instead, it continues to try to process inbound messages from that peer.

Analysis 1

Possibly resolved in #1600.

This panic can happen in two different ways:

The inbound service calls drive_peer_request, which doesn't check for a failed connection state before performing tasks that can potentially call fail_with.
The connection code calls async functions after checking for a failed connection state, so it's possible that the connection can fail between the check and a subsequent call to fail_with

So I suggest that we add the missing failed connection check, downgrade the panic to a warning, and add additional debugging information.

Error 1

called fail_with on already-failed connection state

Metadata 1

key	value
version	1.0.0-alpha.0
git commit	`8a7b023`
Zcash network	Mainnet
location	/zebra/zebra-network/src/peer/connection.rs:515:13

SpanTrace 1

SpanTrace:
   0: zebra_network::peer::connection::msg_as_req
           with msg=getdata
             at zebra-network/src/peer/connection.rs:728
   1: zebra_network::peer::handshake::peer
           with addr=218.8.47.71:8233
             at zebra-network/src/peer/handshake.rs:198
   2: zebrad::application::
           with zebrad="8a7b023" net="Main"
             at zebrad/src/application.rs:254

Error 2

calling fail_with on already-failed connection state: failed connections must stop processing pending requests and responses, then close the connection. state: Failed original error: SharedPeerError(Serialization(Io(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }))) new error: SharedPeerError(Serialization(Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))) client receiver: ClientRequestReceiver { inner: Receiver { inner: Some(BoundedInner { buffer: 0, state: 0, message_queue: Queue { head: 0x7fdcd4830060, tail: UnsafeCell }, parked_queue: Queue { head: 0x7fdcb41f21f0, tail: UnsafeCell }, num_senders: 2, recv_task: AtomicWaker }) } }

Metadata 2

key	value
version	1.0.0-alpha.1
git commit	`1edb379`
Zcash network	Mainnet
location	/home/dev/zebra/zebra-network/src/peer/connection.rs:535:13

SpanTrace 2

SpanTrace:
   0: zebra_network::peer::connection::msg_as_req
           with msg=getdata
             at zebra-network/src/peer/connection.rs:762
   1: zebra_network::peer::handshake::peer
           with addr=163.172.222.30:8233
             at zebra-network/src/peer/handshake.rs:198
   2: zebrad::application::
           with zebrad="1edb379e" net="Main"
             at zebrad/src/application.rs:254

Logs 2

Jan 30 07:09:34.078  INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips: zebra_state::service: pruned utxo requests old_len=4 new_len=0 prune_count=4 tip=Some((Height(1129116), block::Hash("0000000002121f23ad2a4fa88aa1c700e0d0535354d0e8d477f92e454d5fdf81")))
Jan 30 07:09:34.078  INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)]
Jan 30 07:09:34.330  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set
Jan 30 07:09:34.330  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s
Jan 30 07:10:35.331  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: starting sync, obtaining new tips
Jan 30 07:10:35.332  INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)]
Jan 30 07:10:41.332  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set                                                                                     
Jan 30 07:10:41.332  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s
Jan 30 07:11:42.333  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: starting sync, obtaining new tips
Jan 30 07:11:42.334  INFO {zebrad="1edb379e" net="Main"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1129116) min_locator_height=1129017 locators=[Height(1129116), Height(1129115), Height(1129114), Height(1129112), Height(1129108), Height(1129100), Height(1129084), Height(1129052), Height(1129017)]
Jan 30 07:11:48.335  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=3 lookahead_limit=2000
Jan 30 07:11:54.337  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: exhausted prospective tip set                                                                                     
Jan 30 07:11:54.337  INFO {zebrad="1edb379e" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s
Jan 30 07:12:19.657  INFO {zebrad="1edb379e" net="Main"}:peer{addr=163.172.222.30:8233}:msg_as_req{msg=getheaders}: zebra_state::service: pruned utxo requests old_len=7 new_len=5 prune_count=2 tip=Some((Height(1129116), block::Hash("0000000002121f23ad2a4fa88aa1c700e0d0535354d0e8d477f92e454d5fdf81")))   
Jan 30 07:12:19.658  INFO {zebrad="1edb379e" net="Main"}:peer{addr=163.172.222.30:8233}:msg_as_req{msg=getheaders}:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(1129116) response_len=20 chain_tip_height=Height(1129116) stop_height=None intersection_height=Some(Height(1129096))
The application panicked (crashed).
Message:  calling fail_with on already-failed connection state ...

The text was updated successfully, but these errors were encountered:

teor2345 · 2021-01-15T04:38:01Z

@dconnolly do you have the logs leading up to this panic?

teor2345 · 2021-01-15T04:52:30Z

It's hard to tell without a backtrace, but it looks like we're failing on this fail_with call:

self.fail_with(PeerError::WrongMessage("getdata with mixed item types"));

dconnolly · 2021-01-15T05:41:50Z

@dconnolly do you have the logs leading up to this panic?

Not on hand, this was reported via Sentry, I can go digging in the logger

teor2345 · 2021-01-15T06:05:25Z

@dconnolly do you have the logs leading up to this panic?

Not on hand, this was reported via Sentry, I can go digging in the logger

That might be helpful, but I think I've found at least two potential causes already.

zebra-network's Connection expects that `fail_with` is only called once per connection, but the overload handling code continues to process the current request after an overload error, potentially leading to further failures. Closes ZcashFoundation#1599

zebra-network's Connection expects that `fail_with` is only called once per connection, but the overload handling code continues to process the current request after an overload error, potentially leading to further failures. Closes #1599

teor2345 · 2021-02-07T23:01:24Z

This issue wasn't completely fixed by #1600 or #1531, I'm still seeing it on commit 1edb379 (#1620 + #1622).

teor2345 · 2021-02-09T00:55:46Z

Duplicate of #1610

teor2345 · 2021-02-10T04:26:55Z

Marking this as high, because we're actually seeing the bug happen reasonably frequently.

mpguerra · 2021-02-16T10:31:56Z

Shall we close this duplicate? I keep getting distracted by it before realizing we're doing the work in #1610...

teor2345 · 2021-02-16T21:46:26Z

@mpguerra would it help to rename the issue?

I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.

As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.

teor2345 · 2021-02-16T21:46:36Z

Duplicate of #1610

mpguerra · 2021-02-17T09:46:55Z

@mpguerra would it help to rename the issue?

I'd really like to keep it in the sprint, because it's the underlying bug that we have to fix. We just think the refactor is the best way to fix it, because we've tried little patches before, and they haven't solved all the panics.

As an alternative, we could move the bug tags over to #1610, and mark this issue as a duplicate using GitHub syntax. Which is a bit fussy.

It's ok, let's keep it then, I think I'm used to seeing it and connecting it to the other one by now!

teor2345 · 2021-02-17T20:59:17Z

Just to be clear, it's the GitHub "Duplicate of" syntax that's a bit fussy, not anyone in the team!

Unlike Closes or Fixes, it needs to be in a comment by itself.

teor2345 · 2021-02-23T01:29:35Z

We think this issue was caused by continuing to send block or transaction messages after failure:

            Response::Blocks(blocks) => {
                // Generate one block message per block.
                for block in blocks.into_iter() {
                    if let Err(e) = self.peer_tx.send(Message::Block(block)).await {
                        self.fail_with(e);
                    }
                }
            }

https://github.com/ZcashFoundation/zebra/pull/1721/files/5a64d988e03f799e7b3495cbc00409799016c85f#r580708590

This code was rewritten to exit early by #1721.

teor2345 · 2021-02-25T23:05:43Z

#1721 was reverted by #1803, and we did a much smaller fix in #1818.

dconnolly added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage S-needs-investigation Status: Needs further investigation labels Jan 15, 2021

dconnolly added this to To Do in 🦓 via automation Jan 15, 2021

dconnolly added this to No Estimate in Effort Affinity Grouping via automation Jan 15, 2021

teor2345 moved this from No Estimate to XS - 2 in Effort Affinity Grouping Jan 15, 2021

teor2345 added C-cleanup Category: This is a cleanup and removed S-needs-investigation Status: Needs further investigation S-needs-triage Status: A bug report needs triage labels Jan 15, 2021

teor2345 added this to the 2021 Sprint 1 milestone Jan 15, 2021

teor2345 mentioned this issue Jan 15, 2021

Avoid panics when there are multiple failures on the same connection #1600

Merged

2 tasks

dconnolly closed this as completed in #1600 Jan 18, 2021

🦓 automation moved this from To Do to Done Jan 18, 2021

Effort Affinity Grouping automation moved this from XS - 2 to Done Jan 18, 2021

teor2345 mentioned this issue Jan 18, 2021

Refactor the peer connection code to avoid failure handling bugs #1610

Closed

3 tasks

teor2345 mentioned this issue Jan 27, 2021

Fix poll_ready usage for the Inbound service #1620

Merged

2 tasks

teor2345 reopened this Feb 7, 2021

🦓 automation moved this from Done to In progress Feb 7, 2021

mpguerra modified the milestones: 2021 Sprint 1, 2021 Sprint 3 Feb 8, 2021

teor2345 mentioned this issue Feb 8, 2021

Fix candidate set address state handling #1709

Merged

20 tasks

mpguerra removed this from the 2021 Sprint 3 milestone Feb 8, 2021

mpguerra added this to the 2021 Sprint 2 milestone Feb 8, 2021

teor2345 modified the milestones: 2021 Sprint 2, 2021 Sprint 3 Feb 8, 2021

teor2345 moved this from Done to XS - 2 in Effort Affinity Grouping Feb 8, 2021

teor2345 moved this from XS - 2 to No Estimate in Effort Affinity Grouping Feb 9, 2021

teor2345 marked this as a duplicate of #1610 Feb 9, 2021

teor2345 assigned yaahc Feb 9, 2021

teor2345 added the P-High label Feb 10, 2021

teor2345 mentioned this issue Feb 16, 2021

Refactor connection.rs to make fail_with errors impossible #1721

Merged

2 tasks

mpguerra removed this from the 2021 Sprint 3 milestone Feb 16, 2021

yaahc closed this as completed in #1721 Feb 19, 2021

🦓 automation moved this from In progress to Done Feb 19, 2021

Effort Affinity Grouping automation moved this from No Estimate to Done Feb 19, 2021

This was referenced Feb 24, 2021

Test: Revert "Refactor connection.rs to make fail_with errors impossible" #1803

Merged

Stop sending blocks and transactions on error #1818

Merged

mpguerra reopened this Feb 26, 2021

🦓 automation moved this from Done to In progress Feb 26, 2021

mpguerra closed this as completed Feb 26, 2021

🦓 automation moved this from In progress to Done Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

dconnolly commented Jan 15, 2021 •

edited by teor2345

teor2345 commented Jan 15, 2021

teor2345 commented Jan 15, 2021

dconnolly commented Jan 15, 2021

teor2345 commented Jan 15, 2021

teor2345 commented Feb 7, 2021 •

edited

teor2345 commented Feb 9, 2021 •

edited

teor2345 commented Feb 10, 2021

mpguerra commented Feb 16, 2021

teor2345 commented Feb 16, 2021

teor2345 commented Feb 16, 2021

mpguerra commented Feb 17, 2021

teor2345 commented Feb 17, 2021

teor2345 commented Feb 23, 2021

teor2345 commented Feb 25, 2021

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

Panic: called fail_with on already-failed connection state, from msg=GetData #1599

Comments

dconnolly commented Jan 15, 2021 • edited by teor2345

Duplicate

Motivation

Analysis 2

Analysis 1

Error 1

Metadata 1

SpanTrace 1

Error 2

Metadata 2

SpanTrace 2

Logs 2

teor2345 commented Jan 15, 2021

teor2345 commented Jan 15, 2021

dconnolly commented Jan 15, 2021

teor2345 commented Jan 15, 2021

teor2345 commented Feb 7, 2021 • edited

teor2345 commented Feb 9, 2021 • edited

teor2345 commented Feb 10, 2021

mpguerra commented Feb 16, 2021

teor2345 commented Feb 16, 2021

teor2345 commented Feb 16, 2021

mpguerra commented Feb 17, 2021

teor2345 commented Feb 17, 2021

teor2345 commented Feb 23, 2021

teor2345 commented Feb 25, 2021

dconnolly commented Jan 15, 2021 •

edited by teor2345

teor2345 commented Feb 7, 2021 •

edited

teor2345 commented Feb 9, 2021 •

edited