tests linearizability: reproduce and prevent 14571 #14819

chaochn47 · 2022-11-22T02:26:45Z

Related to #14045

Signed-off-by: Chao Chen chaochn@amazon.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ptabor

(I'm sorry - prematurly clicked approve)

tests/linearizability/traffic.go

tests/linearizability/linearizability_test.go

tests/linearizability/auth.go

tests/linearizability/traffic.go

tests/linearizability/failpoints.go

codecov-commenter · 2022-12-19T21:23:44Z

Codecov Report

Merging #14819 (6200b22) into main (6200b22) will not change coverage.
The diff coverage is n/a.

❗ Current head 6200b22 differs from pull request most recent head 1a04dcb. Consider uploading reports for the commit 1a04dcb to get more accurate results

@@           Coverage Diff           @@
##             main   #14819   +/-   ##
=======================================
  Coverage   74.87%   74.87%           
=======================================
  Files         415      415           
  Lines       34288    34288           
=======================================
  Hits        25672    25672           
  Misses       6994     6994           
  Partials     1622     1622

Flag	Coverage Δ
all	`74.87% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tests/linearizability/failpoints.go

chaochn47 · 2023-01-06T10:30:30Z

Tested 60 times with v3.5.5 binary with experimental-snapshot-catchup-entry commit. linearizability_test.go:277: Model is not linearizable while on mainline, it is.

ahrtr · 2023-01-06T11:02:57Z

Tested 60 times with v3.5.5 binary with experimental-snapshot-catchup-entry commit. linearizability_test.go:277: Model is not linearizable while on mainline, it is.

Thanks @chaochn47, please provide a detail steps to reproduce the issue. I may take a look sometime on weekend or next week.

chaochn47 · 2023-01-06T11:41:29Z

Hi @ahrtr, it is not a new issue. The new linearizable test case in current PR is to reproduce and prevent #14571. The original issue #14571 has been fixed in v3.5.6.

chaochn47 · 2023-01-06T12:06:37Z

The PR is ready for review.

@ptabor @serathius Could you please take a second look, thanks!

serathius · 2023-01-13T12:04:02Z

Please resolve conficts.

chaochn47 · 2023-01-18T23:59:19Z

TestLinearizability_ClusterOfSize3

linearizability_test.go:428: Linearization timed out

https://github.com/etcd-io/etcd/actions/runs/3953458632/jobs/6769755835

Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 · 2023-01-19T02:33:46Z

Conflicts have been resolved. Would you mind take a second look? @serathius Thanks!!

serathius · 2023-01-20T08:34:38Z

tests/linearizability/failpoints.go

@@ -33,7 +36,8 @@ const (
 )

 var (
-	KillFailpoint                            Failpoint = killFailpoint{}
+	KillFailpoint                            Failpoint = killFailpoint{target: AnyMember}
+	EnableAuthKillFailpoint                  Failpoint = killFailpoint{enableAuth: true, target: Follower}


Do you need to target follower here? I understand that 1471 happens when follower is killed, however we don't need to hardcode it.

Yeah, it's not necessary targeting follower. But if kill is randomly targeted against leader, clients need to wait on leader election (1 - 2 seconds) which increases test duration (It was an optimization when test repeat time is 60)

serathius · 2023-01-20T08:37:43Z

tests/linearizability/failpoints.go

@@ -33,7 +36,8 @@ const (
 )

 var (
-	KillFailpoint                            Failpoint = killFailpoint{}
+	KillFailpoint                            Failpoint = killFailpoint{target: AnyMember}
+	EnableAuthKillFailpoint                  Failpoint = killFailpoint{enableAuth: true, target: Follower}


I don't understand why failpoint needs to be aware of auth. It's true that at this moment it creates the client, but maybe we can move client creation to different place. Either do dependency injection and provide failpoint with client that is already authorized. Alternative would be to add method to e2e.EtcdProcessCluster that provides the client.

Good suggestion! I can explore each option and see what's the best fit.

serathius · 2023-01-20T08:38:32Z

tests/linearizability/failpoints.go

-	return nil
+
+	if f.enableAuth {
+		require.NoError(t, addTestUserAuth(ctx, endpoints))


Don't like that auth setup is part of failpoint injection. Those are totally separate things. Please move auth setup to cluster setup.

It's actually a failure injection only for the issue 14571 that the test user won't be applied on the restarted member.

In short, auth traffic can be one type of failure injections. Does it make sense?

I think that enabling auth is orthogonal to failure injection.

serathius · 2023-01-20T08:40:15Z

tests/linearizability/linearizability_test.go

+		failpoint   Failpoint
+		config      e2e.EtcdProcessClusterConfig
+		traffic     *trafficConfig
+		clientCount int


This client count doesn't seem to be used. Please remove it and use trafficConfig.clientCount

Thanks for the catch. Will remove it.

serathius · 2023-01-20T08:40:48Z

tests/linearizability/linearizability_test.go

@@ -124,39 +149,40 @@ func TestLinearizability(t *testing.T) {
 				t.Fatal(err)
 			}
 			defer clus.Close()
+			lg := zaptest.NewLogger(t, zaptest.WrapOptions(zap.AddCaller())).Named(tc.name)


This is nice, but please consider moving it to separate PR.

Will do. It's not a big change.

serathius · 2023-01-20T08:41:58Z

tests/linearizability/linearizability_test.go

 	for i := 0; i < config.clientCount; i++ {
+		i := i


This should not be needed as we pass i as argument to goroutine funtion.

Yeah, you are right. It must be left behind due to rebase from main.

serathius · 2023-01-20T08:43:37Z

tests/linearizability/linearizability_test.go

 	if qps < config.minimalQPS {
 		t.Errorf("Requiring minimal %f qps for test results to be reliable, got %f qps", config.minimalQPS, qps)
 	}
 	return operations
 }

+func simulatePostFailpointTraffic(ctx context.Context, wg *sync.WaitGroup, endpoints []string, clientId int, ids identity.Provider, h *model.History, mux *sync.Mutex, config trafficConfig, limiter *rate.Limiter, lm identity.LeaseIdStorage) {


Don't understand why you cannot incorporate this into normal traffic

To incorporate this into normal traffic, a client with test user authorization has to be set up upfront. However, before the user is added to the cluster, client creation will fail.

After a couple of failed attempts, I adopted this workaround. It's more deterministic to reliably reproduce on the impacted 3.5 versions.

Please setup authorization in cluster setup.

serathius · 2023-01-20T08:46:09Z

tests/linearizability/traffic.go

+func (t readWriteSingleKey) PreRun(ctx context.Context, c interfaces.Client, lg *zap.Logger) error {
+	if t.AuthEnabled() {
+		lg.Info("set up auth")
+		return setupAuth(ctx, c)


Authorization setup should be done at cluster setup, not at this point.

Make sense.

chaochn47 · 2023-02-15T21:42:56Z

Revisit this task again, I think the linearizability test does not have to reproduce the exact same scenario how #14571 happened.

#14571 uncovers the issue that auth recover from snapshot failed to update rangePermCache. It's a in-memory map, for each user, maintains a interval tree that checks requested key start to key end is the subnet of the interval tree.

To avoid the back and force on this PR, the new proposed plan will be

At cluster setup stage:

snapshot catch up entry is as low as 1, snapshot count is as low as 1 to speed up raft log compaction and leader always request follower to download snapshot even if it's a brief downtime.
create root user, root role has access to all operations.
create test user, test role, grant test role RW permissions from key foo to zoo

Traffic generator

half of the clients using root user permissions
half of the clients using test user permissions, all the operation will be acting against key range foo to zoo
root user client will periodically grant / revoke test role permissions

Fault injector

kill a random member

The assumption is root user client should not see a key value is inconsistent from different etcd servers. In <=3.5.5, granted/revoked permissions will not be carried over to the restarted member.

@serathius Let me know if this is aligned with the overall linearizability test design principles. Thanks!

stale · 2023-05-21T11:47:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 force-pushed the linearizability_check_issue_14571 branch from b532575 to 58a78e3 Compare November 22, 2022 02:32

chaochn47 closed this Nov 22, 2022

chaochn47 reopened this Nov 22, 2022

chaochn47 marked this pull request as draft November 22, 2022 03:17

ptabor approved these changes Nov 22, 2022

View reviewed changes

ptabor suggested changes Nov 22, 2022

View reviewed changes