Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: race condition in dialer scheduler #23430

Open
fxfactorial opened this issue Aug 22, 2021 · 10 comments
Open

p2p: race condition in dialer scheduler #23430

fxfactorial opened this issue Aug 22, 2021 · 10 comments
Assignees
Labels

Comments

@fxfactorial
Copy link
Contributor

WARNING: DATA RACE                                                                                                    
Read at 0x00c03c44d2a0 by goroutine 152:                                                                              
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).updateStaticPool()                                             
      github.com/ethereum/go-ethereum/p2p/dial.go:433 +0x2e4                                                          
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).expireHistory.func1()                                          
      github.com/ethereum/go-ethereum/p2p/dial.go:378 +0xcf                                                           
  github.com/ethereum/go-ethereum/p2p.(*expHeap).expire()                                                             
      github.com/ethereum/go-ethereum/p2p/util.go:59 +0x10a 
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).expireHistory()
      github.com/ethereum/go-ethereum/p2p/dial.go:375 +0x124
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).loop()
      github.com/ethereum/go-ethereum/p2p/dial.go:308 +0x239a
  github.com/ethereum/go-ethereum/p2p.newDialScheduler.func2()
      github.com/ethereum/go-ethereum/p2p/dial.go:186 +0x53 
  github.com/panjf2000/ants/v2.(*goWorker).run.func1()
      github.com/panjf2000/ants/v2@v2.4.5/worker.go:70 +0xd9

Previous write at 0x00c03c44d2a0 by goroutine 1038:
  github.com/ethereum/go-ethereum/p2p.(*dialTask).resolve() 
      github.com/ethereum/go-ethereum/p2p/dial.go:537 +0x204
  github.com/ethereum/go-ethereum/p2p.(*dialTask).run()
      github.com/ethereum/go-ethereum/p2p/dial.go:498 +0xf9 
  github.com/ethereum/go-ethereum/p2p.(*dialScheduler).startDial.func1()
      github.com/ethereum/go-ethereum/p2p/dial.go:465 +0x58 
  github.com/panjf2000/ants/v2.(*goWorker).run.func1()
      github.com/panjf2000/ants/v2@v2.4.5/worker.go:70 +0xd9


off of latest master -

@MariusVanDerWijden
Copy link
Member

Hmm that is not latest master afaict. Could you link the commit you're using?

@MariusVanDerWijden
Copy link
Member

I fixed some races here: #23434
Don't think it's the one you found though

@fjl
Copy link
Contributor

fjl commented Aug 23, 2021

The race in this issue is on t.dest. When resolve() runs, it may update t.dest. Meanwhile, on the dialer loop, we run updateStaticPool when a peer is removed, and this reads t.dest.

@fxfactorial
Copy link
Contributor Author

My bad , was latest master off bsc and should have written the description @fjl wrote , yes that’s the problem -

@fjl
Copy link
Contributor

fjl commented Aug 24, 2021

I'm actually not sure how to fix this race correctly. The check on t.dest in updateStaticPool seems off to me, maybe that should just be removed.

@fxfactorial
Copy link
Contributor Author

fxfactorial commented Aug 24, 2021

Agree - it wasn’t clear to me how to fix it immediately either , and don’t want locks for just this - if node being removed anyway then .dest being updated shouldn’t be needed then

@fjl fjl changed the title race condition in dialer scheduler p2p: race condition in dialer scheduler Aug 24, 2021
@fjl fjl self-assigned this Oct 10, 2021
@ligi ligi removed the status:triage label Aug 4, 2022
@MariusVanDerWijden
Copy link
Member

This issue has become quite stale now. I can't reproduce the race condition anymore. Feel free to reopen if this is still an issue with the current master.

@fjl
Copy link
Contributor

fjl commented Mar 13, 2023

I'm sure this race is not fixed. It's just hard to reproduce because it requires resolving the node, which takes variable time.

@holiman
Copy link
Contributor

holiman commented Jun 14, 2023

@fjl how about solving it by just making it an atomic.Pointer?
Like this: holiman@79b7615

@fjl
Copy link
Contributor

fjl commented Jun 14, 2023

Yeah I guess this would be a possible fix. Though I think there must be a better way to fix it.

The main issue is really just that the main loop wants access to the node ID, and it reads that through .dest.ID(). So another fix would be to store a separate, immutable reference to the original non-resolved *enode.Node and use that to read the ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants
@fjl @ligi @holiman @fxfactorial @MariusVanDerWijden and others