fix: puller rewrite and bug fixes #3437

istae · 2022-10-18T20:38:04Z

Checklist

I have read the coding guide.
My change requires a documentation update, and I have done it.
I have added tests to cover my changes.
I have filled out the description and linked the related issues.

Description

Fixes bugs tagged in the reserve-bugs branch.
Some bugs are:

syncing with peers only within sync radius, not neighborhood depth
liveSync process wrongly exits on error (specially context cancellation)

We also remove the falling edge detector, it was originally put in place because of high cpu usage during bootup as kademlia was establishing new connections, but the puller now waits for node warm up.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

This change is

aloknerurkar · 2022-10-19T09:46:35Z

pkg/localstore/subscription_pull.go

@@ -44,18 +43,16 @@ func (db *DB) SubscribePull(ctx context.Context, bin uint8, since, until uint64)

 	chunkDescriptors := make(chan storage.Descriptor)

-	in, out, clean := flipflop.NewFallingEdge(flipFlopBufferDuration, flipFlopWorstCaseDuration)
+	trigger := make(chan struct{}, 1)


These triggers would be fired on put operations. As it is buffered the Put operation would block on the pullIndex iteration completion everytime. How would this work?

see line 208, subscription_pull.go.
It is a select with a default, so no blocks

aloknerurkar · 2022-10-19T10:03:23Z

pkg/puller/puller.go

-				p.metrics.LiveWorkerErrCounter.Inc()
+			p.metrics.LiveWorkerErrCounter.Inc()
+
+			if errors.Is(err, context.Canceled) {


This is risky to do if the peer is not reachable. We will stay in this retry loop which could cause a lot of CPU usage. Maybe we should have some retry counter.

Also, I would add a log message to say which attempt it is when we call SyncInterval again. This can help us identify this problem.

Also unit test needs to be added for this behaviour.

we check a specific error, which is context cancellation, unreachable peers do not send a context cancelation error. The loop sleeps for 5 minutes when this errors is detected, it's not a tight spinning loop.

mrekucci · 2022-10-19T12:56:23Z

pkg/localstore/subscription_pull.go

@@ -171,9 +167,9 @@ func (db *DB) SubscribePull(ctx context.Context, bin uint8, since, until uint64)
 		defer db.pullTriggersMu.Unlock()

 		for i, t := range db.pullTriggers[bin] {
-			if t == in {
+			if t == trigger {
 				db.pullTriggers[bin] = append(db.pullTriggers[bin][:i], db.pullTriggers[bin][i+1:]...)


The channel can be closed here.

that will trigger the above select, it should be okay like this

pkg/puller/puller.go

janos

I am not so sure on removing the falling edge detection on subscription. The reasoning in the PR description that it is related to the node bootup does not have to be true. That is a protection for any situation where a large number of chunks are triggering the subscription to call the pullIndex.Iterate too frequently causing an extensive I/O. I think that removing this type of protections should be backed up with measurements.

pkg/puller/puller.go

janos

As a measure of precaution, I would suggest to add regression tests for all issues that these changes are addressing, for both validation and also for protection of reintroducing same issues in the future.

Additionally, I get this test case failure:

=== CONT  TestDepthChange/move_peer_around
    /Users/janos/go/projects/ethswarm.org/bee/pkg/puller/puller_test.go:440: got unexpected interval: [], want [[1 1]] bin 3

which passes on the master branch.

pkg/puller/puller.go

pkg/node/node.go

pkg/puller/puller.go

aloknerurkar · 2022-10-24T09:36:36Z

pkg/puller/puller.go

-			}
-			return
+		top, _, err := p.syncer.SyncInterval(ctx, peer, bin, from, pullsync.MaxCursor)
+		if err != nil {


Same as the comment above. We should ideally look for particular errors to restart, if we get terminal errors like stream reset etc we should quit early.

see comment above

aloknerurkar · 2022-10-24T09:44:45Z

pkg/puller/puller_test.go

+		},
+		pullSync: []mockps.Option{
+			mockps.WithCursors([]uint64{1}),
+			mockps.WithSyncError(errors.New("sync error"))},


It would be better if this is a function which returns error for few times and then returns success. We should also test if this can restart correctly.

test if this can restart correctly.

what do you mean ?

This will always return error from SyncInterval. Instead, we should return error and then on the next call we should succeed.

pkg/puller/puller_test.go

aloknerurkar · 2022-10-24T09:50:54Z

pkg/puller/puller.go

-			// bound to fail.
-			ctxC, cancelC := context.WithTimeout(context.Background(), 10*time.Second)
-			defer cancelC()
-			if err := p.syncer.CancelRuid(ctxC, peer, ruid); err != nil {


Why is this being removed? This makes it a breaking change in the protocols.

@janos Can you add some brief description about this cancel ruid functionality? Maybe we are missing something?

The whole pullsync cancel protocol is a way to signal a call to a context's cancel function for a particular syncing request. To avoid unnecessary wait in pullsync handler if the "client" syncing peer has problems storing intervals. Since stream termination (reset) can be detected only on their Read or Write methods, it was needed to have a mechanism to terminate other functions that happen in between if there is an error on the other peer.

janos

I would add that if the CancelRuid functionality is removed, an alternative approach to possible goroutone leak is required and to cleanup the functionality completely as without CancelRuid function calls, the whole pullsync cancel protocol can be removed.

janos · 2022-10-25T10:29:08Z

.golangci.yml

@@ -23,7 +23,6 @@ linters:
    - importas
    - ineffassign
    - misspell
-    - nakedret


I have no preference on naked vs not-naked returns, but could not see a strong enough need in this pr to change the linter configuration, and consequently the coding policy. The changes with naked returns have no functional requirement, while naked returns could be a source of a subtle problems.

pkg/node/node.go

istae added 6 commits October 14, 2022 16:58

fix: bugs

c66214a

fix: bugs

84b7a37

fix: bugs

20cbcdf

chore: more bugs

6338679

fix: more bugs

e15ce06

fix: puller rewrite

bcc76d4

bee-runner bot added the pull-request label Oct 18, 2022

istae requested review from a team, vladopajic, mrekucci and aloknerurkar and removed request for a team October 18, 2022 20:38

istae added 4 commits October 18, 2022 23:44

fix: trigger fix

bd72ec0

chore: comments

488db7b

chore: lint

8a75cc4

chore: metrics

987a878

aloknerurkar requested a review from janos October 19, 2022 09:55

aloknerurkar suggested changes Oct 19, 2022

View reviewed changes

istae requested a review from metacertain October 19, 2022 10:16

fix: prune gone peers

a03b64d

mrekucci reviewed Oct 19, 2022

View reviewed changes

mrekucci approved these changes Oct 19, 2022

View reviewed changes

vladopajic reviewed Oct 19, 2022

View reviewed changes

pkg/puller/puller.go Show resolved Hide resolved

pkg/puller/puller.go Outdated Show resolved Hide resolved

mrekucci reviewed Oct 19, 2022

View reviewed changes

pkg/puller/puller.go Show resolved Hide resolved

vladopajic reviewed Oct 19, 2022

View reviewed changes

pkg/puller/puller.go Outdated Show resolved Hide resolved

fix: missing unlock

32ac92a

vladopajic reviewed Oct 19, 2022

View reviewed changes

pkg/puller/puller.go Show resolved Hide resolved

istae added 2 commits October 19, 2022 16:55

fix: removed listen on quite

a95ab4f

fix: disconnect with swarm address

740c436

janos reviewed Oct 19, 2022

View reviewed changes

pkg/puller/puller.go Show resolved Hide resolved

istae added 3 commits October 20, 2022 13:29

fix: pullsync and puller sync process to continue (#3444)

081e428

revert: falling edge

ff7688f

fix: missing metric

a6a7ca9

istae force-pushed the reserve-bugs-fix branch from 4fde41d to a6a7ca9 Compare October 20, 2022 11:12

istae requested review from aloknerurkar and vladopajic October 20, 2022 11:51

janos reviewed Oct 20, 2022

View reviewed changes

pkg/puller/puller.go Show resolved Hide resolved

vladopajic reviewed Oct 20, 2022

View reviewed changes

pkg/puller/puller.go Outdated Show resolved Hide resolved

pkg/puller/puller.go Outdated Show resolved Hide resolved

pkg/puller/puller.go Outdated Show resolved Hide resolved

istae added 5 commits October 20, 2022 19:38

fix: context

0f53bfc

chore: tests

c3fb373

fix: remove unncessary disconnect

239a691

fix: timer

1061c6c

chore: retry sleep

6abf7a4

istae added the ready for review The PR is ready to be reviewed label Oct 24, 2022

aloknerurkar suggested changes Oct 24, 2022

View reviewed changes

chore: peer gone test

a7fe7a9

metacertain approved these changes Oct 25, 2022

View reviewed changes

chore: puller sync sleep dur options

ae7a090

janos reviewed Oct 25, 2022

View reviewed changes

aloknerurkar approved these changes Oct 25, 2022

View reviewed changes

pkg/node/node.go Show resolved Hide resolved

istae merged commit 96a0a1a into master Oct 25, 2022

istae deleted the reserve-bugs-fix branch October 25, 2022 11:29

agazso mentioned this pull request Oct 31, 2022

neighborhood reserve discrepancy #3445

Closed

aloknerurkar added this to the 1.10.0 milestone Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: puller rewrite and bug fixes #3437

fix: puller rewrite and bug fixes #3437

istae commented Oct 18, 2022 •

edited by acud

aloknerurkar Oct 19, 2022

istae Oct 19, 2022

aloknerurkar Oct 19, 2022

istae Oct 19, 2022

mrekucci Oct 19, 2022

istae Oct 19, 2022

janos left a comment

janos left a comment

aloknerurkar Oct 24, 2022

istae Oct 24, 2022

aloknerurkar Oct 24, 2022

istae Oct 24, 2022 •

edited

aloknerurkar Oct 24, 2022

aloknerurkar Oct 24, 2022

janos Oct 25, 2022

janos left a comment

janos Oct 25, 2022

fix: puller rewrite and bug fixes #3437

fix: puller rewrite and bug fixes #3437

Conversation

istae commented Oct 18, 2022 • edited by acud

Checklist

Description

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

istae Oct 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

istae commented Oct 18, 2022 •

edited by acud

istae Oct 24, 2022 •

edited