fix(snapshot): Enable snapshot streaming after bulkload #8938

all-seeing-code · 2023-08-08T09:22:35Z

Description: Enable snapshot streaming after bulkload

Summary: Previously, the bulkloaded p directory couldn't stream to a new alpha due to a commitTs of zero. In this PR, the commitTs is sourced from the p directory, allowing the alpha to create a snapshot and subsequently stream it to another alpha.

Tests:

TestBulkLoaderSnapshotPDirinAlpha0: Load a p directory using bulkload. Then, initiate one alpha using the bulkloaded p directory and start a second alpha without any p directory. Query both alphas to ensure the snapshot has been successfully generated and that the data is accessible from both instances.
TestBulkLoaderSnapshotPDirinAll: Load a p directory using bulkload. Then, copy bulkloaded p directory in all alphas and start the cluster. Query all alphas to ensure that the data is accessible from all instances.
TestBulkLoaderDataLoss: Move the zero timestamp and load a p directory using bulkload. Then use this p directory on a fresh cluster (both zero and alpha are new). Validate that the query doesn't work without moving the timestamp of the new zero.

Closes: https://dgraph.atlassian.net/browse/DGRAPHCORE-214

Docs: NA

kevinmingtarja · 2023-08-09T10:54:00Z

worker/draft.go

+			// Need a delay otherwise it interfers with starting of Raft loop
+			time.AfterFunc(1*time.Second, func() {


Can i ask how it might interfere with the starting of Raft loop?

yeah, and we should avoid it. This will turn into a bad idea, soon enough.

When the node initializes, it begins in a follower state, even in a single alpha cluster scenario. This pause ensures that the leader election process occurs, allowing the node to acquire leadership. Without this pause, the node tries to capture a snapshot but fails the check in n.AmLeader(), resulting in no snapshot being taken.

mangalaman93

We need more tests as we discussed, a few questions and minor comments are dgraphtest package

.gitignore

dgraphtest/dgraph.go

dgraphtest/local_cluster.go

mangalaman93 · 2023-08-09T13:33:37Z

dgraphtest/local_cluster.go

@@ -626,6 +702,32 @@ func (c *LocalCluster) Client() (*GrpcClient, func(), error) {
 	return &GrpcClient{Dgraph: client}, cleanup, nil
 }

+// Client returns a grpc client that can talk to any Alpha in the cluster
+func (c *LocalCluster) ClientForAlpha(id int) (*GrpcClient, func(), error) {


We should define this function such that existing (*LocalCluster).Client function can use it too. Feels duplicated code.

Defined one for conn, clients are little high level to be able to use one for the other. Let me know if the modified piece is an ok abstraction.

worker/draft.go

all-seeing-code · 2023-08-15T11:23:48Z

We need more tests as we discussed, a few questions and minor comments are dgraphtest package

I have added more tests and extended the dgraphtest package to support some of the functions required for the new tests. This PR is ready for another round of reviews.

mangalaman93

As discussed, we need to think about more test cases. A few minor comments and questions.

dgraphtest/cluster.go

mangalaman93 · 2023-08-16T04:33:16Z

dgraphtest/cluster.go

+		return fmt.Errorf("failed to assign state. error: %s", resp.Errors[0].Message)
+	}
+
+	if resp, err := parseAssignIdResponse(body); err == nil && resp.StartId != "" {


what if there an error here?

mangalaman93 · 2023-08-16T05:04:30Z

dgraphtest/dgraph.go

@@ -92,13 +93,29 @@ type dnode interface {
 	zeroURL(*LocalCluster) (string, error)
 }

+type nodeType interface {


why did we define a new interface? Can we not use the existing interface?

mangalaman93 · 2023-08-16T05:15:40Z

dgraphtest/dgraph.go

+			if err := os.MkdirAll(pDir, os.ModePerm); err != nil {
+				return nil, errors.Wrap(err, "erorr creating bulk dir")
+			}
+			if err != nil {


why are we checking for error here again?

mangalaman93 · 2023-08-16T11:43:27Z

dgraphtest/load.go

@@ -493,3 +493,28 @@ func (c *LocalCluster) BulkLoad(opts BulkOpts) error {
 		return nil
 	}
 }
+
+func (c *LocalCluster) CopyBulkLoadDirsToAlphaMounts() error {


why are we doing the copy when we can just let the data stream?

This is for the test when we want to copy p directories in each of the alphas in the cluster and start them. This is different from the case when we let the data stream from one alpha to all the other alpha (covered in: TestBulkLoaderSnapshotPDirinAlpha0)

mangalaman93 · 2023-08-16T11:51:54Z

dgraphtest/config.go

@@ -183,6 +184,13 @@ func (cc ClusterConfig) WithBulkLoadOutDir(dir string) ClusterConfig {
 	return cc
 }

+// WithBulkLoadpDirIn sets the p dir for the alphas. This controls


I think this option is not clear. Took me some time to make sense of it. We should try to come up with a better name and possible value.

mangalaman93 · 2023-08-16T11:52:40Z

systest/integration2/bulk_loader_snapshot_test.go

+
+	// Start and query each alpha
+	for i := 0; i < 3; i++ {
+


nit: remove the newlines

mangalaman93 · 2023-08-16T11:54:31Z

worker/draft.go

@@ -1837,6 +1852,11 @@ func (n *node) InitAndStartNode() {
 			n.SetRaft(raft.StartNode(n.Cfg, peers))
 			// Trigger election, so this node can become the leader of this single-node cluster.
 			n.canCampaign = true
+			// Also trigger a snapshot so that this node can take a snapshot if required


can we loop on whether a leader is chosen instead?

Add tests

d1af24a

dgraph-bot added area/testing Testing related issues area/core internal mechanisms go Pull requests that update Go code labels Aug 8, 2023

Update the test

b3e509a

all-seeing-code marked this pull request as ready for review August 8, 2023 15:03

all-seeing-code requested review from mangalaman93, meghalims, sanjayk-github-dev, harshil-goel, billprovince and joshua-goldstein as code owners August 8, 2023 15:03

all-seeing-code added 11 commits August 8, 2023 20:39

remove dead code

4c79cd6

We dont need no wait

0102d23

fix lint

2a5aae7

update gitignot

ea8ff33

Fix test

a9dc229

fix gitignore

3310c4e

add nil check on pstore

3f7c6b0

Have odd number of alphas for quorum

be0f281

Remove io-util Write

728b6c4

Move check health inside startAlpha

0407102

Make the test concise

f0fab22

kevinmingtarja reviewed Aug 9, 2023

View reviewed changes

Move around test main

13e2f77

mangalaman93 requested changes Aug 9, 2023

View reviewed changes

all-seeing-code added 4 commits August 14, 2023 02:22

Fix tests

b4ef7b4

rename variables

10034e9

Added a for-loop in tests

1ee48fc

Add more tests

8c216a4

all-seeing-code added 11 commits August 15, 2023 01:46

Fix dgraphtest package

bba6b89

Merge branch 'main' into all-seeing/stream-snap-bulk

e3b6a5b

fix tests

81c6a83

Fix tests

0e0ade7

Add more logging

c0d96be

lint fix

ab0ba98

Move retry inside assign state

9a08306

fix lint

5858ac8

make gofmt happy

7f23a36

add an if

1be0056

Add another test

e7bcc63

all-seeing-code requested review from mangalaman93 and kevinmingtarja August 15, 2023 11:26

make it sfw

fab5702

mangalaman93 requested changes Aug 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(snapshot): Enable snapshot streaming after bulkload #8938

fix(snapshot): Enable snapshot streaming after bulkload #8938

all-seeing-code commented Aug 8, 2023 •

edited

kevinmingtarja Aug 9, 2023

mangalaman93 Aug 9, 2023

all-seeing-code Aug 13, 2023

mangalaman93 left a comment

mangalaman93 Aug 9, 2023

all-seeing-code Aug 14, 2023

all-seeing-code commented Aug 15, 2023

mangalaman93 left a comment

mangalaman93 Aug 16, 2023

mangalaman93 Aug 16, 2023

mangalaman93 Aug 16, 2023

mangalaman93 Aug 16, 2023

all-seeing-code Aug 18, 2023

mangalaman93 Aug 16, 2023

mangalaman93 Aug 16, 2023

mangalaman93 Aug 16, 2023

		// Need a delay otherwise it interfers with starting of Raft loop
		time.AfterFunc(1*time.Second, func() {

fix(snapshot): Enable snapshot streaming after bulkload #8938

Are you sure you want to change the base?

fix(snapshot): Enable snapshot streaming after bulkload #8938

Conversation

all-seeing-code commented Aug 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangalaman93 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

all-seeing-code commented Aug 15, 2023

mangalaman93 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

all-seeing-code commented Aug 8, 2023 •

edited