Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [self-delegated snaps] #72083

Closed
cockroach-teamcity opened this issue Oct 28, 2021 · 188 comments · Fixed by #88641
Closed
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss sync-me-8 T-kv KV Team
Projects

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 28, 2021

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ d91fead28392841a943251842fbd43a0affb2eca:

		  | main.(*monitorImpl).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:116
		  | main.(*monitorImpl).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:124
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1071
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:905
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:777
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 11: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1300,context.go:91,cluster.go:1288,test_runner.go:866: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3647465-1635401487-34-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		1: 13290
		3: 12804
		2: 13472
		8: skipped
		6: 11783
		12: skipped
		7: 11853
		11: dead (exit status 137)
		9: 11501
		5: 12334
		10: 11563
		Error: UNCLASSIFIED_PROBLEM: 11: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1175
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:281
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2104
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 11: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Help

See: roachtest README

|

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-10940

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 28, 2021
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV Oct 28, 2021
@AlexTalks
Copy link
Contributor

It is surprising that we are still seeing OOMs on this test despite merging #71132 - potentially related to #71802

@AlexTalks AlexTalks removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 29, 2021
@tbg
Copy link
Member

tbg commented Nov 4, 2021

image

https://share.polarsignals.com/73a06c8/

@erikgrinaker this seems to be something we should be looking into more actively. It is "sort of" expected that we're seeing lots of memory held up by sideloaded proposals; after all this phase of the test mostly crams lots of SSTs into our log and then asks us to send them to two followers, who are possibly also a region hop away. But something seems to have changed as we didn't use to see this and also #71132 hasn't prevented it from happening, and I looked before and couldn't see any obvious other leaks. So currently I am expecting that we will see that we happen to have a lot of groups catch up followers at once, overwhelming the system. If that is the case, it would be difficult to even think of a quick fix. We would need to either delay adding new entries to the log or sending entries to followers. The latter happens inside of raft, so the easier choice is the former. Then the question becomes, do we apply it to SSTs only, or to all proposals? SSTs is easier since there is already a concept of delaying them, plus they are not that sensitive to it. But first we need to see that what I'm describing is really what we're seeing.

@erikgrinaker
Copy link
Contributor

Yeah, this seems bad. We seem to be enforcing per-range size limits that should mostly prevent this, so I agree that this seems likely to be because we're catching up many groups at once.

Would it be worth bisecting this to find out what triggered it?

@erikgrinaker erikgrinaker added this to Incoming in Replication via automation Nov 4, 2021
@erikgrinaker erikgrinaker removed this from roachtest/unit test backlog in KV Nov 4, 2021
@erikgrinaker erikgrinaker changed the title roachtest: tpccbench/nodes=9/cpu=4/multi-region failed roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [raft oom] Nov 4, 2021
@tbg
Copy link
Member

tbg commented Nov 4, 2021

Hard to say, it sure would be nice to know the commit if there is one. On the other hand, it would likely be extremely painful. I think I used to do hundreds of runs when working on #69414, though, and never saw the OOM there. This was based on ab1fc34, so I think that would be our "good" commit (though it has the inconsistency). Now when did I first see this OOM? I think it was in #71050. Note that this isn't the exact same OOM (the memory is held in the inefficiency fixed in #71132) but I think this is still the same.

Hmm, maybe it's fine? Really depends on how clean the repro loop is. I think we should run import/tpcc/warehouses=4000/geo as tpccbench does lots of stuff not related to the import assuming it does get past the import. import/tpcc takes roundabout an hour so we should be able to see something. I might take this as the excuse to get #70435 back into shape and to see how far we can get.

tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect start
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect good ab1fc34
tobias@td:~/go/src/github.com/cockroachdb/cockroach$ git bisect bad d1231cff60125b397ccce6c79c9aeea771cdcca4
Bisecting: 311 revisions left to test after this (roughly 8 steps)
warning: unable to rmdir 'pkg/ui/yarn-vendor': Directory not empty
Submodule path 'vendor': checked out 'fcef703fb087367037cfd20f9576875c2cec9092'
[ecffc89299760b8bf5f966030fd524475b4095ca] kv: deflake and unskip TestPushTxnUpgradeExistingTxn

edit: test balloon launched,

BRANCH=release-21.2 SHA=$(git rev-parse HEAD) TEST=import/tpcc/warehouses=4000/geo COUNT=1 ~/roachstress-ci.sh

https://teamcity.cockroachdb.com/viewLog.html?buildId=3683316&

@tbg
Copy link
Member

tbg commented Nov 4, 2021

Ok, the roachstress-CI thing seems to work. Going to log the bisect here and update as I make progress.

I'm using

BRANCH=release-21.2 SHA=$(git rev-parse HEAD) TEST=import/tpcc/warehouses=4000/geo COUNT=50 ~/roachstress-ci.sh

d1231cf (confirming starting bad commit): https://teamcity.cockroachdb.com/viewQueued.html?itemId=3683412, we expect this to produce the failure
ab1fc34 (confirming starting good commit): https://teamcity.cockroachdb.com/viewQueued.html?itemId=3683413, this should not produce the failure
ecffc89 (bisect step 1): https://teamcity.cockroachdb.com/viewLog.html?buildId=3683411&

@tbg
Copy link
Member

tbg commented Nov 4, 2021

Hmm so stressing this test (import/tpcc/warehouses=4000/geo) worked great, the problem is all 50 runs passed on all three commits.

@tbg
Copy link
Member

tbg commented Nov 4, 2021

Screw it, going to try stressing tpccbench as is. I don't have it in me to patch each commit to just do the import, etc.; let's see what we get.

@tbg
Copy link
Member

tbg commented Nov 4, 2021

@tbg
Copy link
Member

tbg commented Nov 5, 2021

They all passed too. We were supposed to see an oom here.

@erikgrinaker
Copy link
Contributor

Interesting, I suppose there must have been aggravating circumstances in the initial failure -- perhaps a failure mode that caused concurrent AddSSTable requests to pile up.

I had a look at the debug.zip, and noticed that we have several nodes with ~200 outbound snapshots in progress concurrently:

 $ grep 'kvserver.sendSnapshot' */stacks.txt | cut -f 1 -d / | uniq -c
      2 1
    165 4
    188 6
    195 7
    203 8

All of these appear to come via Replica.adminScatter. I'm speculating here, but seems plausible that if this amount of ranges were seeing concurrent AddSSTable traffic, then after the snapshots were applied we'd have to catch up ~200 ranges with AddSSTable entries. 3 GB / 200 ranges works out to about 15 MB/range, which is in the right ballpark.

@tbg
Copy link
Member

tbg commented Nov 8, 2021

Just for the record, if we wanted to limit the size of the messages, we'd have to work something down into raft onto this line

https://github.com/cockroachdb/vendored/blob/master/go.etcd.io/etcd/raft/v3/raft.go#L435

Instead of a fixed maxMsgSize we would need to pass an interface that dynamically limits the budget, i.e. something like

limiter interface {
  Request(size uint64) bool
}

and if the limiter returns false, we don't send anything else. The main new thing that comes out of this is that maybeSendAppend may end up sending nothing even though there is something that should be sent (in the current impl, it will send at least one entry in that case), not sure if that causes problems for any of the (few) callers. We'd also have to think about starvation. One very busy raft group may starve out another that is "just trying to send a single SST". So the underlying impl would have to "remember" a failed call on the assumption that the call will happen again soon. But we also need to figure out how wait until to try again. It's not entirely straightforward to set this all up.

@cockroach-teamcity

This comment has been minimized.

@tbg
Copy link
Member

tbg commented Nov 11, 2021

Last failure is [perm denied #72635]

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 23, 2022

@andrewbaptist @erikgrinaker Thanks for the heads-up. The grpc bump to v1.47 is problematic for other reasons too (#81881), so I will upgrade to v1.49 soon. The issue you caught seems similar to grpc/grpc-go#5512, which was also fixed in v1.49.

@cockroach-teamcity
Copy link
Member Author

roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on master @ 34dc56fbb5789b39be47b110bf22332c7f5654f6:

test artifacts and logs in: /artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	monitor.go:127,tpcc.go:1113,tpcc.go:950,test_runner.go:928: monitor failure: monitor task failed: Non-zero exit code: 1
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runTPCCBench
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:1113
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCCBenchSpec.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:950
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor task failed
		Wraps: (5) Non-zero exit code: 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *install.NonZeroExitCode

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Sep 23, 2022
87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality

88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru

In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with
2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not
need to construct the SQL representation of the table's schema. This
results in a marked improvement in the performance of `SHOW BACKUP`
as can be seen in #88376 (comment).

Fixes: #88376

Release note (performance improvement): `SHOW BACKUP` on a backup containing
several table descriptors is now more performant

88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner

Fixes #87246

This will also improve tracing.

Release note: None

88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna

The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server.

The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing.

Release justification: non-production code change

Release note: None

88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov

Fixes #81881
Touches #72083

Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused

88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner

A schema might not exist because it has been dropped. We need to mark the lookup as required.

Fixes #87895

Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped.

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@andrewbaptist
Copy link
Collaborator

This looks like the issue is that admin scatter takes ~1 hour on these tests. Looking at runs that succeeded / failed, they all take on the order of 55+ minutes to complete this step. Running manually twice also confirmed this number. I'm planning to change the test

const prepareTimeout = 60 * time.Minute
to allow a longer timeout.

andrewbaptist added a commit that referenced this issue Sep 24, 2022
Relates to #72083. Allow scatter to complete.

Release note: None
@andrewbaptist andrewbaptist linked a pull request Sep 24, 2022 that will close this issue
@andrewbaptist
Copy link
Collaborator

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout. Based on this being a test timing issue only, it would make sense to backport this to 22.2 (and probably also remove the release blocker label).

$ grep -A1 "cockroach workload run tpcc" */test.log
run_1/test.log:20:19:03 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_1/test.log-22:24:00 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_10/test.log:20:28:20 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_10/test.log-22:40:26 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_2/test.log:20:17:44 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_2/test.log-22:35:12 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_3/test.log:20:18:55 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_3/test.log-22:30:35 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_4/test.log:20:26:13 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_4/test.log-22:28:04 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_5/test.log:20:22:22 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_5/test.log-22:29:30 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_6/test.log:20:24:04 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_6/test.log-22:27:54 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_7/test.log:20:18:38 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_7/test.log-22:13:23 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_8/test.log:20:17:41 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_8/test.log-22:24:49 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)
--
run_9/test.log:20:20:39 cluster.go:2018: > ./cockroach workload run tpcc --warehouses=3000 --workers=3000 --max-rate=490 --wait=false --ramp=15m0s --duration=45m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
run_9/test.log-22:34:23 tpcc.go:1139: initializing cluster for 2000 warehouses (search attempt: 1)```

@erikgrinaker
Copy link
Contributor

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout.

Do we know why this started failing a month ago? Is this now expected behavior, or is the slowdown pathological?

@andrewbaptist
Copy link
Collaborator

The best I can tell is that the test was working consistently until about May 26th. After that it didn't run again until July 8th (although I'm not sure why). From July 8th until now, it has been failing about half the time.
https://teamcity.cockroachdb.com/test/-7667002519850730298?currentProjectId=Cockroach_Nightlies

I tried to see if there was any related reason for this, however there have been a lot of changes between then and now. Even when it was succeeding before, it did run for about the same time as now (4-5 hours total) - so I'm not sure what exactly changed.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 26, 2022

Ok, thanks. If we're sure the 1 hour+ scatter times here are expected then I suggest we close this out with the timeout bump, and deal with any new failures separately. Thanks for looking into this!

@andrewbaptist andrewbaptist removed GA-blocker S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Sep 26, 2022
@andrewbaptist
Copy link
Collaborator

Sounds good - merging this timeout change. This is definitely a good test to have running consistently. It is possible that the scatter times have gotten slightly worse, but scatter was completely rewritten about 6-9 months ago. So it is possible that fixes that occurred to that over the past few months are related. It is also a strange operation that should be re-examined at some point in the near future as it runs "out-of-band" of other things.

craig bot pushed a commit that referenced this issue Sep 26, 2022
88550: kvserver: use execution timestamps for verified when available r=erikgrinaker a=tbg

Now that "most" operations save their execution timestamps, use them
for verification.

This has the undesirable side effect of failing the entire test suite,
which didn't bother specifying timestamps for most operations.

Now they are required, and need to be present, at least for all
mutations.

I took the opportunity to also clean up the test helpers a bit,
so now we don't have to pass an `error` when it's not required.

The big remaining caveat is that units that return with an ambiguous
result don't necessarily have a commit timestamp. I *think* this is only
an implementation detail. We *could* ensure that `AmbiguousResultError`
always contains the one possible commit timestamp. This should work
since `TxnCoordSender` is always local to `kvnemesis`, and so there's
no "fallible" component between the two.

This would result in a significant simplification of `kvnemesis`, since
as is when there are ambiguous deletions, we have to materialize them
but cannot assign them a timestamp. This complicates various code paths
and to be honest I'm not even sure what exactly we verify and how it all
works when there are such "half-materialized" writes. I would rather do
away with the concept altogether. Clearly we also won't be able to
simplify the verification to simply use commit order if there are
operations that don't have a timestamp, which is another reason to keep
pushing on this.

Release note: None


88641: workload: Bump prepare timeout to 90 minute r=aayushshah15 a=andrewbaptist

Relates to #72083. Allow scatter to complete.

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Andrew Baptist <baptist@cockroachlabs.com>
@craig craig bot closed this as completed in #88641 Sep 26, 2022
KV automation moved this from Prioritized to Closed Sep 26, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 30, 2022
Relates to #72083. Allow scatter to complete.

Release note: None
@srosenberg
Copy link
Member

The issue was that scatter was taking between 55 minutes and 1hr 18 minutes to complete (based on running the test 10 times). There was nothing hung however and the tests all completed successfully after bumping the timeout. Based on this being a test timing issue only, it would make sense to backport this to 22.2 (and probably also remove the release blocker label).

Two follow-up questions,

  • is --scatter a prerequisite for this test? i.e., the preceding step–fixtures import already does that to a degree. Is the resulting distribution (of ranges) after import not sufficiently uniform?
  • is --scatter taking longer now not a KV regression?

@andrewbaptist
Copy link
Collaborator

These are great questions and unfortunately, I don't know the full answers to either of them.

  1. It would certainly be better to remove the call to scatter from this test since it is artificial and bypasses much of the normal control mechanisms, however, I don't know if that would expose other failure modes. I think it is necessary because the data is initially written from a single location and the goal is to simulate a system that has data evenly read/written from all locations.

  2. The scatter implementation changed dramatically about 6 months ago and its performance isn't something we are too concerned about. We know that most "bulk snapshot" operations (e.g. decommissioning) have gotten considerably faster between 22.1 and 22.2. This applies in particular to running these operations in parallel with other things happening on the system. It is possible these improvements caused a regression. Aayush and I did some analysis on the performance of this scatter and it appears to be running efficiently, so we were not concerned about the time it took to complete based on the amount of data.

@exalate-issue-sync exalate-issue-sync bot added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss sync-me-8 T-kv KV Team
Projects
KV
Closed
Development

Successfully merging a pull request may close this issue.