Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add start stop e2e test #1705

Merged
merged 21 commits into from Dec 16, 2021
Merged

Conversation

piersy
Copy link
Contributor

@piersy piersy commented Sep 29, 2021

Description

This PR adds a test that starts and stops nodes in a small network and verifies that the network can continue to function correctly throughout the process.

A quick overview of what the test does:

  1. Start 4 nodes
  2. Verify they can confirm a tx
  3. Stop one node
  4. Verify the remaining 3 can confirm a tx
  5. Stop another node
  6. Verify the remaining 2 cannot confirm a tx
  7. Start the last stopped node
  8. Verify that the previously un-confirmable tx can now be confirmed by the 3 running nodes.
  9. Verify that the 3 running nodes can confirm a further tx
  10. Start the remaining node
  11. Verify that the netowrk can process another tx and that all sent txs have been processed by all nodes.

Other changes

This PR makes the minSyncPeers, dialHistoryExpiration and InboundThrottleTime configurable, in order for this test to run/run quickly these values needed to be set and they were previously hardcoded.

This PR also ensures correct shutdown of the miner, worker and backend.Backend, and fixes a number of data races/ race conditions. Each fix is described in the associated commit messages, it's probably easier to review this PR commit by commit, since these changes otherwise look unrelated.

I think this PR would be best merged rater than 'squashed and merged' due to the large number of commits containing unrelated fixes.

Upstream PRs that came out of this work

Tested

I managed to run this test 200 times on my local machine with no failures.

Related issues

#1642

Backwards compatibility

Yes

@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch 5 times, most recently from 2fce4a5 to 00a9f7b Compare September 30, 2021 21:29
@piersy piersy marked this pull request as ready for review September 30, 2021 21:46
@piersy piersy requested review from trianglesphere and a team as code owners September 30, 2021 21:46
@piersy piersy changed the title Piersy/add start stop e2e test rebased Add start stop e2e tes Sep 30, 2021
@piersy piersy changed the title Add start stop e2e tes Add start stop e2e test Sep 30, 2021
@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 00a9f7b to 00358de Compare October 1, 2021 07:12
@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #1705 (801a432) into master (58ca798) will increase coverage by 0.05%.
The diff coverage is 59.13%.

❗ Current head 801a432 differs from pull request most recent head b9ba309. Consider uploading reports for the commit b9ba309 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1705      +/-   ##
==========================================
+ Coverage   56.03%   56.08%   +0.05%     
==========================================
  Files         606      606              
  Lines       80293    80333      +40     
==========================================
+ Hits        44989    45057      +68     
+ Misses      31769    31737      -32     
- Partials     3535     3539       +4     
Impacted Files Coverage Δ
eth/backend.go 0.00% <0.00%> (ø)
miner/miner.go 0.00% <0.00%> (ø)
p2p/peer.go 73.54% <0.00%> (-0.58%) ⬇️
core/blockchain.go 59.52% <25.64%> (-1.42%) ⬇️
consensus/istanbul/core/core.go 68.95% <75.00%> (+0.12%) ⬆️
consensus/istanbul/backend/backend.go 50.80% <100.00%> (ø)
consensus/istanbul/backend/handler.go 29.58% <100.00%> (+0.45%) ⬆️
consensus/istanbul/backend/peer_handler.go 58.33% <100.00%> (ø)
consensus/istanbul/core/handler.go 83.46% <100.00%> (+0.26%) ⬆️
eth/downloader/downloader.go 76.62% <100.00%> (+0.14%) ⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58ca798...b9ba309. Read the comment docs.

@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 00358de to a04b020 Compare October 1, 2021 13:55
@piersy
Copy link
Contributor Author

piersy commented Oct 1, 2021

Coverage from tests in ./e2e_test/... for ./consensus/istanbul/... at commit a04b020

coverage:  51.5% of statements across all listed packages
coverage:  64.1% of statements in consensus/istanbul
coverage:  51.8% of statements in consensus/istanbul/backend
coverage:   0.0% of statements in consensus/istanbul/backend/backendtest
coverage:  50.0% of statements in consensus/istanbul/backend/internal/db
coverage:  46.5% of statements in consensus/istanbul/backend/internal/enodes
coverage:  23.6% of statements in consensus/istanbul/backend/internal/replica
coverage:  71.0% of statements in consensus/istanbul/core
coverage:   0.0% of statements in consensus/istanbul/proxy
coverage:  75.3% of statements in consensus/istanbul/uptime
coverage: 100.0% of statements in consensus/istanbul/uptime/store
coverage:  51.8% of statements in consensus/istanbul/validator
coverage:  79.2% of statements in consensus/istanbul/validator/random

@piersy piersy added this to the Reliability track milestone Oct 1, 2021
@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from a04b020 to 99a2d7f Compare October 4, 2021 11:51
@piersy piersy self-assigned this Oct 4, 2021
@@ -210,6 +210,7 @@ type Backend struct {

aWallets atomic.Value

wg sync.WaitGroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the amount of properties backend already has, I'd consider adding a much more descriptive variable name for this waitgroup (and a small comment explaining why it may be needed). closeReadyWG, or something along those lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like we may just be able to remove this wg, since the go-routine it was waiting for doesn't appear to be necessary.

@@ -196,6 +196,7 @@ type BlockChain struct {

quit chan struct{} // blockchain quit channel
wg sync.WaitGroup // chain processing wait group for shutting down
wgMu sync.Mutex // chain processing wait group for shutting down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mutex intended to make sure the Wait happens after the first Add? AFAIK sync.WaitGroup is thread safe in its use and there shouldn't be any need to syncronize its access, but I may be missing some border case

Copy link
Contributor Author

@piersy piersy Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutex ensures that wait and add cannot be called concurrently see - golang/go#23842

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, it's thread safe if you call Add when the count is already > 0, but if it's exactly 0 then it's not thread safe. Either a broken API or a weird use by ourselves.

Given that in this particular use case Add should never be called from a different goroutine without the other lock, I think we better encapsulate this in a new type (DynamicWaitGroup or whatever name you can think of) and have its own internal mutex, to avoid adding one more reference to the blockchain type, and also making sure that any new code written here cannot accidentally call the Add without the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, but I've actually cherry picked a fix from upstream that was merged during the lifetime of this PR, because I think that is going to cause less problems when merging.

ethereum/go-ethereum#22853

trie/database.go Outdated
@@ -778,10 +778,15 @@ func (db *Database) Commit(node common.Hash, report bool, callback func(common.H
// commit is the private locked version of Commit.
func (db *Database) commit(hash common.Hash, batch ethdb.Batch, uncacher *cleaner, callback func(common.Hash)) error {
// If the node does not exist, it's a previously committed node

db.lock.RLock()
node, ok := db.dirties[hash]
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as moving the RUnlock call right after the db.dirties[hash] read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is :) thank you !

@hbandura
Copy link
Contributor

hbandura commented Oct 5, 2021

I'm confused about some changes. Are the p2p changes and the eth/config changes actually part of this PR or is github botching the diff due to a merge? If they are, they seem like big (well, not big, but important) and I'd rather have them in a separate PR to make them more standout more

@piersy
Copy link
Contributor Author

piersy commented Oct 5, 2021

I'm confused about some changes. Are the p2p changes and the eth/config changes actually part of this PR or is github botching the diff due to a merge? If they are, they seem like big (well, not big, but important) and I'd rather have them in a separate PR to make them more standout more

Hey @hbandura they are meant to be part of this pr. I think the commit comments make this clear.

Allow configuring the min peers needed to sync
In order to run small scale e2e tests reliably we need to be able to let
nodes sync from just one other node. This commit allows the minimum sync
peers to be configured, where previously it was hardcoded to 5.

And

Allow modification of dial and connection timeouts
Now dialHistoryExpiration and inboundThrottleTime can be configured via
config. This allows for the writing of fast tests.

I would suggest just checking this PR commit by commit for a better understanding of all the individual changes that it contains.

We could still split these changes out, if you think that would be better. I generally try to keep changes with the thing that precipitated the changes, since that makes it clear why the changes are needed. Otherwise you end up with PRs having changes that don't seem to serve any purpose. That's why I kept all these changes together because they are all needed for this test to work.

@piersy piersy requested a review from hbandura October 5, 2021 12:47
@piersy
Copy link
Contributor Author

piersy commented Oct 5, 2021

Coverage from tests in ./e2e_test/... for ./consensus/istanbul/... at commit 1ab51ee

coverage: 51.4% of statements across all listed packages
coverage:  64.1% of statements in consensus/istanbul
coverage:  51.7% of statements in consensus/istanbul/backend
coverage:   0.0% of statements in consensus/istanbul/backend/backendtest
coverage:  50.0% of statements in consensus/istanbul/backend/internal/db
coverage:  46.0% of statements in consensus/istanbul/backend/internal/enodes
coverage:  24.3% of statements in consensus/istanbul/backend/internal/replica
coverage:  70.8% of statements in consensus/istanbul/core
coverage:   0.0% of statements in consensus/istanbul/proxy
coverage:  75.3% of statements in consensus/istanbul/uptime
coverage: 100.0% of statements in consensus/istanbul/uptime/store
coverage:  51.8% of statements in consensus/istanbul/validator
coverage:  79.2% of statements in consensus/istanbul/validator/random
CommentID: 9bd7c06e8e

@@ -549,7 +551,11 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan
return err
}
}
go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)
sb.wg.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this waitgroup face the same problem with the concurrent Add ? or are Commit and Close both called from the same thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, this is not safe. I'm thinking to just remove the use of the go-routine. It doesn't seem to affect the tests and it makes the code simpler.

@piersy piersy modified the milestone: Reliability track Oct 20, 2021
@piersy piersy requested a review from hbandura October 21, 2021 08:46
Copy link
Contributor

@hbandura hbandura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving it, but with some comments, feel free to address them

}

// Close locks the mutex, then closes it.
func (cm *ClosableMutex) Close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Close work if the mutex is locked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hernan this mutex was added by the go-ethereum team, its part of the fix I cherry-picked from upstream.

So they would know best, but my understanding is that it should wait to acquire the lock before closing the channel, since closing the channel prior to obtaining the lock could result in calls to Unlock panicking.

@@ -0,0 +1,64 @@
// Copyright 2021 The go-ethereum Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The Celo Authors"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix cherry picked from go-ethereum

@@ -549,8 +549,7 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan
return err
}
}
go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)

sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is running this on a goroutine no longer necessary due to the other syncronization added in worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever was necessary. All I did was remove the go routine and nothing broke.

@@ -351,6 +357,8 @@ func (w *worker) mainLoop() {
var cancel context.CancelFunc
var wg sync.WaitGroup

// Ensure that block construction is complete before exiting this function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand on why block construction is needed before exiting, if possible

Copy link
Contributor Author

@piersy piersy Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I meant there was just that we wait for all started go routines to close before exiting, and its the process of block construction that is starting those go-routines, but its not actually important that block construction has finished, just that all started go routines have finished.

So what about ?

// Ensure that all started go-routines finish before exiting this function.

@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 49c2f9d to ff1dd46 Compare December 15, 2021 12:20
In order to run small scale e2e tests reliably we need to be able to let
nodes sync from just one other node. This commit allows the minimum sync
peers to be configured, where previously it was hardcoded to 5.
Now dialHistoryExpiration and inboundThrottleTime can be configured via
config. This allows for the writing of fast tests.
Ensure that Add is called before Wait.
backend.Backend.Commit was starting a goroutine but not waiting for it
to complete.

This commit adds a wait group to wait for the goroutine to complete when
backend.Backend is closed.
Also make miner wait for goroutines to end at Shutdown.
Ensure Add is called before Wait
core.current was being read via calls to core.ParentCommits from the
worker's main loop, and written via calls to core.Stop coming from the
miner's update loop.

The fix is to add a mutext to synchronise access to current.

==================
WARNING: DATA RACE
Write at 0x00c00114dd98 by goroutine 329:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
  github.com/celo-org/celo-blockchain/miner.(*worker).stop()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
  github.com/celo-org/celo-blockchain/miner.(*Miner).update()
      /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb

Previous read at 0x00c00114dd98 by goroutine 471:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).ParentCommits()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:204 +0x4a
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:989 +0x38a
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404
  github.com/celo-org/celo-blockchain/miner.prepareBlock()
      /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f
  github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84
  github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c
core.resetResendRoundChangeTimer was being called from the main engine
go-routine and core.stopResendRoundChangeTimer was being called via
core.Stop from miner's update go-routine. Both accessed
resendRoundChangeMessageTimer without syncronisation.

This commit adds a mutex to ensure synchronised access to
resendRoundChangeMessageTimer.

==================
WARNING: DATA RACE
Read at 0x00c01377f568 by goroutine 72:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopResendRoundChangeTimer()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:710 +0x47
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopAllTimers()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:719 +0x54
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:61 +0x44
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
  github.com/celo-org/celo-blockchain/miner.(*worker).stop()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
  github.com/celo-org/celo-blockchain/miner.(*Miner).update()
      /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb

Previous write at 0x00c01377f568 by goroutine 312:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resetResendRoundChangeTimer()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:780 +0x368
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resendRoundChangeMessage()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:795 +0x7b
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleResendRoundChangeEvent()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:247 +0x353
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleEvents()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:143 +0x77e
The current round state of the engine was being read without
synchronisation via core.CurrentView by the miner the current round
state was also being written in the engines stop function.

The fix is to use the pre-existing mutex to syncronise accesses through
CurrentView.

==================
WARNING: DATA RACE
Read at 0x00c000bb20b8 by goroutine 506:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).CurrentView()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:195 +0x4a
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.waitCoreToReachSequence()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1203 +0x24c
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:982 +0x130
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404
  github.com/celo-org/celo-blockchain/miner.prepareBlock()
      /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f
  github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84
  github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c

Previous write at 0x00c000bb20b8 by goroutine 93:
  github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193
  github.com/celo-org/celo-blockchain/miner.(*worker).stop()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb
  github.com/celo-org/celo-blockchain/miner.(*Miner).update()
      /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb
The eth.chainSyncer main loop periodically calls
eth.chainSyncer.startSync which fires off a sync operation in its own
goroutine which will eventually result in Downloader.Cancel being
called.

The eth.chainSyncer also calls Downloader.Cancel when exiting its main
loop.

This resulted in a data race on Downloader.ancientLimit, the solution
was to add a mutex to lock over Downloader.ancientLimit in
Downloader.Cancel.

==================
WARNING: DATA RACE
Write at 0x00c01a0e3ce0 by goroutine 466:
  github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel()
      /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68
  github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).synchronise()
      /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:444 +0x601
  github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Synchronise()
      /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:354 +0xb5
  github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).doSync()
      /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:327 +0x1ad
  github.com/celo-org/celo-blockchain/eth.(*chainSyncer).startSync.func1()
      /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:303 +0x59

Previous write at 0x00c01a0e3ce0 by goroutine 182:
  github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel()
      /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68
  github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Terminate()
      /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:635 +0xcc
  github.com/celo-org/celo-blockchain/eth.(*chainSyncer).loop()
      /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:232 +0x65b
p2p.PurposeFlag is not threadsafe but was being accessed from many
go-routines without syncronisation via p2p.Peer.HasPurpose.

The solution was to lock over the access in HasPurpose using the mutex
that already existed to synchronise access to a peer's purposes flag.

==================
WARNING: DATA RACE
Read at 0x00c001520680 by goroutine 291:
  github.com/celo-org/celo-blockchain/p2p.(*Peer).HasPurpose()
      /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:167 +0x5fd
  github.com/celo-org/celo-blockchain/eth.(*peer).PurposeIsSet()
      /home/pierspowlesland/projects/celo-blockchain/eth/peer.go:696 +0x516
  github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).FindPeers()
      /home/pierspowlesland/projects/celo-blockchain/eth/handler.go:1055 +0x612
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*validatorPeerHandler).ReplaceValidatorPeers()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/peer_handler.go:147 +0x201
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend/internal/enodes.(*ValidatorEnodeDB).RefreshValPeers()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/internal/enodes/val_enode_db.go:480 +0x641
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).RefreshValPeers()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/backend.go:846 +0x264
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StartValidating()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:691 +0x3a5
  github.com/celo-org/celo-blockchain/miner.(*worker).start()
      /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:210 +0x1a2
  github.com/celo-org/celo-blockchain/miner.(*Miner).update()
      /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:161 +0x6a4

Previous write at 0x00c001520680 by goroutine 343:
  github.com/celo-org/celo-blockchain/p2p.(*Peer).AddPurpose()
      /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:148 +0xd5
  github.com/celo-org/celo-blockchain/p2p.(*Server).run.func3()
      /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:866 +0x2d1
  github.com/celo-org/celo-blockchain/p2p.(*Server).run()
      /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:916 +0x241a
announceRunning is set in istanbul.Backend.StartAnnouncing and
istanbul.Backend.StopAnnouncing and also was being accessed without
syncronisation from within istanbul.Backend.newChainHead which is
executed repeatedly from within its own go-routine.

This commit read locks over announceRunning in
istanbul.Backend.newChainHead to avoid a data race.

Unfortunately we can't see both sides of this data race, it seems the
stack was corrupted somehow since the first stack trace cannot be
correct (how can fmt.Fscanf call TestStartStopValidators ?). Ive
shortened the stack with an elipsis because it was also very long.

==================
WARNING: DATA RACE
Write at 0x00c001448520 by goroutine 354:
  github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopAnnouncing()
      /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:749 +0x12b
  github.com/celo-org/celo-blockchain/eth.(*Ethereum).stopAnnounce()
      /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:507 +0x2ad
  github.com/celo-org/celo-blockchain/eth.(*Ethereum).Stop()
      /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:575 +0x2ae
  github.com/celo-org/celo-blockchain/node.(*Node).stopServices()
      /home/pierspowlesland/projects/celo-blockchain/node/node.go:309 +0x150
  github.com/celo-org/celo-blockchain/node.(*Node).Close()
      /home/pierspowlesland/projects/celo-blockchain/node/node.go:221 +0x1bb
  github.com/celo-org/celo-blockchain/test.(*Node).Close()
      /home/pierspowlesland/projects/celo-blockchain/test/node.go:312 +0x3b8
  github.com/celo-org/celo-blockchain/test.Network.Shutdown()
      /home/pierspowlesland/projects/celo-blockchain/test/node.go:498 +0x9a
  runtime.call32()
      /usr/local/go/src/runtime/asm_amd64.s:551 +0x3d
  testing.(*T).FailNow()
      <autogenerated>:1 +0x44
  github.com/stretchr/testify/require.NoError()
      /home/pierspowlesland/go/pkg/mod/github.com/stretchr/testify@v1.4.0/require/require.go:974 +0x104
  github.com/celo-org/celo-blockchain/e2e_test_test.TestStartStopValidators()
      /home/pierspowlesland/projects/celo-blockchain/e2e_test/e2e_test.go:168 +0x15ce
  fmt.Fscanf()
      /usr/local/go/src/fmt/scan.go:143 +0xee
  fmt.Sscanf()
      /usr/local/go/src/fmt/scan.go:114 +0x191
  github.com/syndtr/goleveldb/leveldb/storage.fsParseName()
      /home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/storage/file_storage.go:643 +0xa6
  github.com/syndtr/goleveldb/leveldb/storage.(*fileStorage).List()
      /home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/storage/file_storage.go:458 +0x309
  github.com/syndtr/goleveldb/leveldb.(*DB).checkAndCleanFiles()
      /home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db_util.go:52 +0x301
  github.com/syndtr/goleveldb/leveldb.openDB()
      /home/pierspowlesland/go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20190923125748-758128399b1d/leveldb/db.go:136 +0x9bb
  fmt.(*ss).doScanf()
      /usr/local/go/src/fmt/scan.go:1230 +0x411
  fmt.Fscanf()
  ...

Previous read at 0x00c001448520 by goroutine 203:
  [failed to restore the stack]
@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from ff1dd46 to 6c36712 Compare December 15, 2021 18:03
piersy and others added 6 commits December 15, 2021 19:52
Removing the go-routine does not seem to have any effect on the tests.
It seems that very occasionally we are hitting this error condition
and its not clear why.
…2853)

This change removes misuses of sync.WaitGroup in BlockChain. Before this change,
block insertion modified the WaitGroup counter in order to ensure that Stop would wait
for pending operations to complete. This was racy and could even lead to crashes
if Stop was called at an unfortunate time. The issue is resolved by adding a specialized
'closable' mutex, which prevents chain modifications after stopping while also
synchronizing writers with each other.

Co-authored-by: Felix Lange <fjl@twurst.com>
The start stop test would sometimes time out while nodes were stopped
and the Network.Shutdown method would print errors relating to trying to
stop a node twice.

The shutdown mechanism has been updated to allow for more flexibility in
deciding what errors are printed, allowing us to avoid printing double
shutdown errors when we have manually stopped nodes.
@piersy piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 6c36712 to 1ab51ee Compare December 15, 2021 19:53
@mcortesi mcortesi merged commit 75f0db3 into master Dec 16, 2021
@mcortesi mcortesi deleted the piersy/add-start-stop-e2e-test-rebased branch December 16, 2021 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants