New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-hardcoded ports for multinode cluster #167
Conversation
8539e8c
to
c087468
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I just left comments about adding more logs
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @renatolabs, and @smg260)
testserver/testservernode.go
line 74 at r1 (raw file):
joinAddrs = append(joinAddrs, fmt.Sprintf("localhost:%s", ts.pgURL[otherNodeID].u.Port())) case <-time.After(5 * time.Second): // If the other node hasn't started yet, don't add the join arg.
nit: might be helpful for future debugging if we can log something here
And should we also retry connecting to this node?
testserver/testservernode.go
line 78 at r1 (raw file):
} joinArg := fmt.Sprintf("--join=%s", strings.Join(joinAddrs, ","))
nit: can we add a log after exiting the loop, just to show eventually what nodes are going to be joined?
or maybe more generally just print out the whole startCmd
right before executing it.
Just for the debugging purpose.
testserver/testservernode.go
line 84 at r1 (raw file):
// The start command always requires a --join arg, so we fake one // if we don't have any yet. joinArg = "--join=localhost:0"
just curious why we need this ...? Can't we just don't set the join flag?
Should we also include the env vars in the log? |
Previously, ZhouXing19 (Jane Xing) wrote…
nvm I just saw the log in the |
Previously, ZhouXing19 (Jane Xing) wrote…
ditto, just saw the log below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these fixes! Mostly LGTM except for a few questions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, and @smg260)
testserver/testservernode.go
line 71 at r1 (raw file):
} select { case <-ts.pgURL[otherNodeID].set:
What are the concurrency guarantees we attempt to make with testserver
? I see there's some effort in these files to guard against concurrent access, but with this code, it would be problematic if we called StartNode
and StopNode
concurrently. Is that a problem?
testserver/testservernode.go
line 73 at r1 (raw file):
case <-ts.pgURL[otherNodeID].set: joinAddrs = append(joinAddrs, fmt.Sprintf("localhost:%s", ts.pgURL[otherNodeID].u.Port())) case <-time.After(5 * time.Second):
IIUC, this "timeout" is here to filter out nodes that have not been started yet, right? Is t here any way we could reflect that with ts.pgURL
instead? It seems right now we are creating a fixed-size slice on startup, meaning there's no way, at this point, to know which nodes we have attempted to start, and which we have not. IMO, this would be more reliable if we knew what nodes we know we should wait for, and then fail if we don't hear their PGURL after some time (5s or 10s).
Another thing is that if we wanted to start a (say) 10-node server, wouldn't it mean that we would spend a lot of time sleeping in this loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)
testserver/testservernode.go
line 71 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
What are the concurrency guarantees we attempt to make with
testserver
? I see there's some effort in these files to guard against concurrent access, but with this code, it would be problematic if we calledStartNode
andStopNode
concurrently. Is that a problem?
there's one test called TestRestartNodeParallel
, and i took that to be the "contract" offered by this package.
it does concurrently call StopNode and StartNode, but the same node is never started and stopped concurrently.
testserver/testservernode.go
line 73 at r1 (raw file):
IIUC, this "timeout" is here to filter out nodes that have not been started yet, right?
that's right.
Is there any way we could reflect that with ts.pgURL instead?
hm, that makes sense. we can add another channel there
Another thing is that if we wanted to start a (say) 10-node server, wouldn't it mean that we would spend a lot of time sleeping in this loop?
yeah i did notice this. initially i had this as case default:
but that made the TestRestartNodeParallel
test (see above) fail, since everything would be unable to find a join port.
testserver/testservernode.go
line 74 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
nvm I just saw the log in the
if currCmd.Process != nil {}
below
done
testserver/testservernode.go
line 78 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
ditto, just saw the log below
done
testserver/testservernode.go
line 84 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
just curious why we need this ...? Can't we just don't set the join flag?
The start
command required the join flag
testserver/testservernode.go
line 127 at r1 (raw file):
Previously, ZhouXing19 (Jane Xing) wrote…
Should we also include the env vars in the log?
sure, sgtm
This required the join and init args to be computed on the fly, since we can no longer infer them until after a node has started.
c087468
to
94807f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)
testserver/testserver.go
line 127 at r2 (raw file):
type pgURLChan struct { // started will be closed after the start command is executed. started chan struct{}
Wouldn't this be more clearly expressed with a bool
? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)
testserver/testserver.go
line 127 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Wouldn't this be more clearly expressed with a
bool
? What am I missing?
it's possible for two different goroutines to access this. e.g. in TestRestartNodeParallel
one goroutine could be writing at the same time as reading the bool. or do you mean use an AtomicBool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)
testserver/testserver.go
line 127 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
it's possible for two different goroutines to access this. e.g. in
TestRestartNodeParallel
one goroutine could be writing at the same time as reading the bool. or do you mean use anAtomicBool
?
I meant a regular bool
. The testserver keeps one pgURLChan
per node, and it seems our contract is that we don't call StartNode
and StopNode
for the same node concurrently. When do we have a race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @renatolabs, @smg260, and @ZhouXing19)
testserver/testserver.go
line 127 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I meant a regular
bool
. The testserver keeps onepgURLChan
per node, and it seems our contract is that we don't callStartNode
andStopNode
for the same node concurrently. When do we have a race?
We have a race because calling StartNode
on one node makes this code look at pgURLChan
from another node.
Suppose node 1 is started concurrently with node 2 being stopped then:
- node 1 could be reading the value here: https://github.com/rafiss/cockroach-go/blob/94807f6139a9b1c624e989912917aedf02f360af/testserver/testservernode.go#L73
- node 2 could be writing the value here: https://github.com/rafiss/cockroach-go/blob/94807f6139a9b1c624e989912917aedf02f360af/testserver/testservernode.go#L44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, @smg260, and @ZhouXing19)
testserver/testserver.go
line 127 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
We have a race because calling
StartNode
on one node makes this code look atpgURLChan
from another node.Suppose node 1 is started concurrently with node 2 being stopped then:
- node 1 could be reading the value here: https://github.com/rafiss/cockroach-go/blob/94807f6139a9b1c624e989912917aedf02f360af/testserver/testservernode.go#L73
- node 2 could be writing the value here: https://github.com/rafiss/cockroach-go/blob/94807f6139a9b1c624e989912917aedf02f360af/testserver/testservernode.go#L44
Doh, of course!
tftrs! |
This required the join and init args to be computed on the fly, since we can no longer infer them until after a node has started.
informs cockroachdb/cockroach#98549
informs cockroachdb/cockroach#98612
informs cockroachdb/cockroach#98594