Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-hardcoded ports for multinode cluster #167

Merged
merged 1 commit into from Mar 16, 2023

Conversation

rafiss
Copy link
Contributor

@rafiss rafiss commented Mar 14, 2023

This required the join and init args to be computed on the fly, since we can no longer infer them until after a node has started.

informs cockroachdb/cockroach#98549
informs cockroachdb/cockroach#98612
informs cockroachdb/cockroach#98594

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss rafiss force-pushed the allow-non-hardcoded-ports branch 9 times, most recently from 8539e8c to c087468 Compare March 15, 2023 15:28
Copy link
Contributor

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just left comments about adding more logs

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @renatolabs, and @smg260)


testserver/testservernode.go line 74 at r1 (raw file):

			joinAddrs = append(joinAddrs, fmt.Sprintf("localhost:%s", ts.pgURL[otherNodeID].u.Port()))
		case <-time.After(5 * time.Second):
			// If the other node hasn't started yet, don't add the join arg.

nit: might be helpful for future debugging if we can log something here

And should we also retry connecting to this node?


testserver/testservernode.go line 78 at r1 (raw file):

	}
	joinArg := fmt.Sprintf("--join=%s", strings.Join(joinAddrs, ","))

nit: can we add a log after exiting the loop, just to show eventually what nodes are going to be joined?
or maybe more generally just print out the whole startCmd right before executing it.
Just for the debugging purpose.


testserver/testservernode.go line 84 at r1 (raw file):

			// The start command always requires a --join arg, so we fake one
			// if we don't have any yet.
			joinArg = "--join=localhost:0"

just curious why we need this ...? Can't we just don't set the join flag?

@ZhouXing19
Copy link
Contributor

testserver/testservernode.go line 127 at r1 (raw file):

	err := currCmd.Start()
	if currCmd.Process != nil {
		log.Printf("process %d started: %s", currCmd.Process.Pid, strings.Join(args, " "))

Should we also include the env vars in the log?

@ZhouXing19
Copy link
Contributor

testserver/testservernode.go line 74 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

nit: might be helpful for future debugging if we can log something here

And should we also retry connecting to this node?

nvm I just saw the log in the if currCmd.Process != nil {} below

@ZhouXing19
Copy link
Contributor

testserver/testservernode.go line 78 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

nit: can we add a log after exiting the loop, just to show eventually what nodes are going to be joined?
or maybe more generally just print out the whole startCmd right before executing it.
Just for the debugging purpose.

ditto, just saw the log below

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these fixes! Mostly LGTM except for a few questions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, and @smg260)


testserver/testservernode.go line 71 at r1 (raw file):

		}
		select {
		case <-ts.pgURL[otherNodeID].set:

What are the concurrency guarantees we attempt to make with testserver? I see there's some effort in these files to guard against concurrent access, but with this code, it would be problematic if we called StartNode and StopNode concurrently. Is that a problem?


testserver/testservernode.go line 73 at r1 (raw file):

		case <-ts.pgURL[otherNodeID].set:
			joinAddrs = append(joinAddrs, fmt.Sprintf("localhost:%s", ts.pgURL[otherNodeID].u.Port()))
		case <-time.After(5 * time.Second):

IIUC, this "timeout" is here to filter out nodes that have not been started yet, right? Is t here any way we could reflect that with ts.pgURL instead? It seems right now we are creating a fixed-size slice on startup, meaning there's no way, at this point, to know which nodes we have attempted to start, and which we have not. IMO, this would be more reliable if we knew what nodes we know we should wait for, and then fail if we don't hear their PGURL after some time (5s or 10s).

Another thing is that if we wanted to start a (say) 10-node server, wouldn't it mean that we would spend a lot of time sleeping in this loop?

Copy link
Contributor Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)


testserver/testservernode.go line 71 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

What are the concurrency guarantees we attempt to make with testserver? I see there's some effort in these files to guard against concurrent access, but with this code, it would be problematic if we called StartNode and StopNode concurrently. Is that a problem?

there's one test called TestRestartNodeParallel, and i took that to be the "contract" offered by this package.

it does concurrently call StopNode and StartNode, but the same node is never started and stopped concurrently.


testserver/testservernode.go line 73 at r1 (raw file):

IIUC, this "timeout" is here to filter out nodes that have not been started yet, right?

that's right.

Is there any way we could reflect that with ts.pgURL instead?

hm, that makes sense. we can add another channel there

Another thing is that if we wanted to start a (say) 10-node server, wouldn't it mean that we would spend a lot of time sleeping in this loop?

yeah i did notice this. initially i had this as case default: but that made the TestRestartNodeParallel test (see above) fail, since everything would be unable to find a join port.


testserver/testservernode.go line 74 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

nvm I just saw the log in the if currCmd.Process != nil {} below

done


testserver/testservernode.go line 78 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

ditto, just saw the log below

done


testserver/testservernode.go line 84 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

just curious why we need this ...? Can't we just don't set the join flag?

The start command required the join flag


testserver/testservernode.go line 127 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Should we also include the env vars in the log?

sure, sgtm

This required the join and init args to be computed on the fly, since we
can no longer infer them until after a node has started.
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)


testserver/testserver.go line 127 at r2 (raw file):

type pgURLChan struct {
	// started will be closed after the start command is executed.
	started chan struct{}

Wouldn't this be more clearly expressed with a bool? What am I missing?

Copy link
Contributor Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)


testserver/testserver.go line 127 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Wouldn't this be more clearly expressed with a bool? What am I missing?

it's possible for two different goroutines to access this. e.g. in TestRestartNodeParallel one goroutine could be writing at the same time as reading the bool. or do you mean use an AtomicBool?

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)


testserver/testserver.go line 127 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

it's possible for two different goroutines to access this. e.g. in TestRestartNodeParallel one goroutine could be writing at the same time as reading the bool. or do you mean use an AtomicBool?

I meant a regular bool. The testserver keeps one pgURLChan per node, and it seems our contract is that we don't call StartNode and StopNode for the same node concurrently. When do we have a race?

Copy link
Contributor Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)


testserver/testserver.go line 127 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I meant a regular bool. The testserver keeps one pgURLChan per node, and it seems our contract is that we don't call StartNode and StopNode for the same node concurrently. When do we have a race?

We have a race because calling StartNode on one node makes this code look at pgURLChan from another node.

Suppose node 1 is started concurrently with node 2 being stopped then:

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)


testserver/testserver.go line 127 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

We have a race because calling StartNode on one node makes this code look at pgURLChan from another node.

Suppose node 1 is started concurrently with node 2 being stopped then:

Doh, of course!

@rafiss rafiss merged commit 2c9d026 into cockroachdb:master Mar 16, 2023
2 checks passed
@rafiss
Copy link
Contributor Author

rafiss commented Mar 16, 2023

tftrs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants