Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: TestGossipOrphanedStallDetection failed #81881

Closed
cockroach-teamcity opened this issue May 26, 2022 · 15 comments · Fixed by #81987 or #88562
Closed

gossip: TestGossipOrphanedStallDetection failed #81881

cockroach-teamcity opened this issue May 26, 2022 · 15 comments · Fixed by #81987 or #88562
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv-replication KV Replication Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 26, 2022

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ 7a17498a9679853612cb88d82a4a3952d1015f94:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.17s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-16105

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels May 26, 2022
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV May 26, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label May 26, 2022
@cockroach-teamcity
Copy link
Member Author

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ 7a17498a9679853612cb88d82a4a3952d1015f94:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.17s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

This test on roachdash | Improve this report!

@andreimatei
Copy link
Contributor

happened to me in CI :S

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 27, 2022

Bisected to 1600491. The bulk of that change looks ok, but the changes to some of the networking deps look suspect.

@rhu713 I'm going to assign this to you. When you get a chance, could you investigate and determine why that change is having this effect? Thanks!

@nvanbenschoten nvanbenschoten removed this from roachtest/unit test backlog in KV May 27, 2022
@nvanbenschoten nvanbenschoten added this to Triage in Disaster Recovery Backlog via automation May 27, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 27, 2022

cc @cockroachdb/bulk-io

@nvanbenschoten
Copy link
Member

Here's what I was running during the bisect:

dev test ./pkg/gossip --filter=TestGossipOrphanedStallDetection --stress --stress-args=--maxruns=5000

@cockroach-teamcity
Copy link
Member Author

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ 1e2cc61b58dc14386bb68dca59814874648931c2:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.15s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.15s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.17s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

gossip.TestGossipOrphanedStallDetection failed with artifacts on master @ e6815947a050e32f21e983aa30dc74ab2a247af3:

=== RUN   TestGossipOrphanedStallDetection
    gossip_test.go:658: condition failed to evaluate within 45s: n2 not yet connected
--- FAIL: TestGossipOrphanedStallDetection (45.16s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • TAGS=bazel,gss,deadlock

This test on roachdash | Improve this report!

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 23, 2022

This might hit us again. I upgraded grpc to 1.47 in #88470 to fix a crash. But then I discovered this issue, and tried the manual stress-test as in #81881 (comment). It fails with the 1.47 PR, but succeeds after rolling back.

I'll try to find a later grpc version that works, as there have been a couple of releases since then.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 23, 2022

The test still fails on 1.48, but succeeds on 1.49. I will upgrade grpc to 1.49, and bisect to find the cause of this bug.

@pav-kv pav-kv self-assigned this Sep 23, 2022
@pav-kv pav-kv reopened this Sep 23, 2022
Disaster Recovery Backlog automation moved this from Done to Triage Sep 23, 2022
@erikgrinaker erikgrinaker added T-kv-replication KV Replication Team and removed T-disaster-recovery labels Sep 23, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 23, 2022

cc @cockroachdb/replication

@exalate-issue-sync exalate-issue-sync bot assigned rhu713 and unassigned pav-kv Sep 23, 2022
@exalate-issue-sync exalate-issue-sync bot added T-disaster-recovery and removed T-kv-replication KV Replication Team labels Sep 23, 2022
Disaster Recovery Backlog automation moved this from Triage to Done Sep 23, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 23, 2022

cc @cockroachdb/disaster-recovery

@exalate-issue-sync exalate-issue-sync bot assigned pav-kv and unassigned rhu713 Sep 23, 2022
@exalate-issue-sync exalate-issue-sync bot reopened this Sep 23, 2022
Disaster Recovery Backlog automation moved this from Done to Triage Sep 23, 2022
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication KV Replication Team and removed T-disaster-recovery labels Sep 23, 2022
@blathers-crl
Copy link

blathers-crl bot commented Sep 23, 2022

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 23, 2022

The root cause was fixed in grpc/grpc-go#5503 (our stress-test fails before it, and succeeds after). This reinforces my intention to upgrade to grpc@v1.49 which includes this fix, and on which the test succeeds too.

craig bot pushed a commit that referenced this issue Sep 23, 2022
87533: sqlliveness: add timeouts to heartbeats r=ajwerner a=aadityasondhi

Previously, sqlliveness heartbeat operations could block on the transactions that were involved. This change introduces some timeouts of the length of the heartbeat during the create and refresh operations.

Resolves #85541

Release note: None

Release justification: low-risk bugfix to existing functionality

88293: backupccl: elide expensive ShowCreate call in SHOW BACKUP r=stevendanna a=adityamaru

In #88376 we see the call to `ShowCreate` taking ~all the time on a cluster with
2.5K empty tables. In all cases except `SHOW BACKUP SCHEMAS` we do not
need to construct the SQL representation of the table's schema. This
results in a marked improvement in the performance of `SHOW BACKUP`
as can be seen in #88376 (comment).

Fixes: #88376

Release note (performance improvement): `SHOW BACKUP` on a backup containing
several table descriptors is now more performant

88471: sql/schemachanger: plumb context, check for cancelation sometimes r=ajwerner a=ajwerner

Fixes #87246

This will also improve tracing.

Release note: None

88557: testserver: add ShareMostTestingKnobsWithTenant option r=msbutler a=stevendanna

The new ShareMostTestingKnobs copies nearly all of the testing knobs specified for a TestServer to any tenant started for that server.

The goal here is to make it easier to write tests that depend on testing hooks that work under probabilistic tenant testing.

Release justification: non-production code change

Release note: None

88562: upgrade grpc to v.1.49.0 r=erikgrinaker a=pavelkalinnikov

Fixes #81881
Touches #72083

Release note: upgraded grpc to v1.49.0 to fix a few panics that the old version caused

88568: sql: fix panic due to missing schema r=ajwerner a=ajwerner

A schema might not exist because it has been dropped. We need to mark the lookup as required.

Fixes #87895

Release note (bug fix): Fixed a bug in pg_catalog tables which could result in an internal error if a schema is concurrently dropped.

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@craig craig bot closed this as completed in #88562 Sep 23, 2022
Disaster Recovery Backlog automation moved this from Triage to Done Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv-replication KV Replication Team
Development

Successfully merging a pull request may close this issue.

6 participants