server/auth: invalidate range permission cache during recovering from… #13920

mitake · 2022-04-10T14:10:42Z

… snapshot

The above issue reports a problem that authStore.Recover() doesn't invalidate rangePermCache, so an etcd node which is isolated from its cluster might not invalidate stale permission cache after resolving the network partitioning. This PR fixes the issue by invalidating the cache in a defensive manner.

cc @ptabor

… snapshot

codecov-commenter · 2022-04-10T14:32:37Z

Codecov Report

Merging #13920 (2e034d2) into main (0c9a4e0) will decrease coverage by 0.80%.
The diff coverage is 84.00%.

❗ Current head 2e034d2 differs from pull request most recent head 241d211. Consider uploading reports for the commit 241d211 to get more accurate results

@@            Coverage Diff             @@
##             main   #13920      +/-   ##
==========================================
- Coverage   72.71%   71.91%   -0.81%     
==========================================
  Files         469      469              
  Lines       38398    38414      +16     
==========================================
- Hits        27923    27624     -299     
- Misses       8710     9016     +306     
- Partials     1765     1774       +9

Flag	Coverage Δ
all	`71.91% <84.00%> (-0.81%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
api/v3rpc/rpctypes/error.go	`90.47% <ø> (ø)`
server/etcdserver/api/v3rpc/util.go	`74.19% <ø> (ø)`
server/etcdserver/errors.go	`0.00% <ø> (ø)`
server/etcdserver/v3_server.go	`78.17% <80.00%> (-0.21%)`	⬇️
client/v3/mirror/syncer.go	`76.19% <100.00%> (+1.83%)`	⬆️
server/etcdserver/server.go	`84.38% <100.00%> (-0.49%)`	⬇️
server/proxy/httpproxy/reverse.go	`0.00% <0.00%> (-63.03%)`	⬇️
server/proxy/httpproxy/metrics.go	`38.46% <0.00%> (-61.54%)`	⬇️
client/v3/snapshot/v3_snapshot.go	`0.00% <0.00%> (-54.35%)`	⬇️
server/proxy/httpproxy/proxy.go	`25.58% <0.00%> (-46.52%)`	⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9a4e0...241d211. Read the comment docs.

serathius · 2022-04-10T14:39:11Z

Is this reproducible on isolated members with serialiable requests? Could we maybe add an e2e tests?

mitake · 2022-04-12T13:49:43Z

Yeah, let me add e2e test cases.

ahrtr · 2022-04-13T06:17:56Z

server/auth/store.go

@@ -388,6 +388,8 @@ func (as *authStore) Recover(be AuthBackend) {
 		as.tokenProvider.enable()
 	}
 	as.enabledMu.Unlock()
+
+	as.rangePermCache = make(map[string]*unifiedRangePermissions)


Two comments:

Suggest to call clearCachedPerm;

There is a potential race condition, some requests (such as v3_server.go#L128 and watch.go#L235 ) coming from API (outside of the applying workflow) may be accessing the rangePermCache concurrently. It seems that we need to add a lock to protect it.

+1. Good catch.

In general the existing locking strategy seems week here:

The access to this cache is protected by readtx.Lock()

readTx is owned by backend that we are swapping here.

so seems theoretically possible that there would be 2 transactions concurrently accessing/modifying the cache... one having tx on old backend, the other the new backend. Closing of the old backend is fully asynchronic:

etcd/server/etcdserver/server.go

Lines 1000 to 1008 in 2e034d2

go func() {

lg.Info("closing old backend file")

defer func() {

lg.Info("closed old backend file")

}()

if err := oldbe.Close(); err != nil {

lg.Panic("failed to close old backend", zap.Error(err))

}

}()

Thus seems cache should have its own lock instead of piggybacking on transaction lock.

Thus seems cache should have its own lock instead of piggybacking on transaction lock

Exactly! Please note that the cache can't be protected by the readtx.Lock(), because a batchTx and a readRx can execute concurrently.

Thanks for pointing out that! I think it's an independent issue, let me open a dedicated PR for it.

Opened an independent PR here: #13954 It's great if you can review.

mitake · 2022-07-03T14:57:41Z

#13954 was merged so I'll resume this PR.

stale · 2022-10-15T21:37:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr · 2022-10-15T22:25:39Z

@mitake we should have already resolved the issue. could you double confirm this?

FYI.
#14227
#14358
#14574

mitake · 2022-10-29T04:13:27Z

@ahrtr Yes, I think we can close this PR. The most reliable way to cause this issue was membership change as shown in #14571
I think prior to #13954 I think the stale rangePermCache could cause real issues rarely because rangePermCache can be stale only if this sequence can happen: 1. network partition, 2. auth config update, 3. WAL compaction, 4. rejoin.
Note that network partition caused by restarting etcd process makes rangePermCache empty and it contributes to avoiding the inconsistency issue in the above sequence until #14571. This is because before the PR, rangePermCache just needs to be updated during data access. So having empty rangePermCache after restarting isn't harmful.

ahrtr · 2022-10-29T04:46:44Z

Thanks @mitake

server/auth: invalidate range permission cache during recovering from…

241d211

… snapshot

ahrtr reviewed Apr 13, 2022

View reviewed changes

ptabor marked this pull request as draft April 29, 2022 08:21

stale bot added the stale label Oct 15, 2022

stale bot removed the stale label Oct 15, 2022

mitake closed this Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server/auth: invalidate range permission cache during recovering from… #13920

server/auth: invalidate range permission cache during recovering from… #13920

mitake commented Apr 10, 2022

codecov-commenter commented Apr 10, 2022 •

edited

serathius commented Apr 10, 2022

mitake commented Apr 12, 2022

ahrtr Apr 13, 2022

ptabor Apr 13, 2022

ahrtr Apr 13, 2022

mitake Apr 14, 2022

mitake Apr 17, 2022

mitake commented Jul 3, 2022

stale bot commented Oct 15, 2022

ahrtr commented Oct 15, 2022

mitake commented Oct 29, 2022

ahrtr commented Oct 29, 2022

	go func() {
	lg.Info("closing old backend file")
	defer func() {
	lg.Info("closed old backend file")
	}()
	if err := oldbe.Close(); err != nil {
	lg.Panic("failed to close old backend", zap.Error(err))
	}
	}()

server/auth: invalidate range permission cache during recovering from… #13920

server/auth: invalidate range permission cache during recovering from… #13920

Conversation

mitake commented Apr 10, 2022

codecov-commenter commented Apr 10, 2022 • edited

Codecov Report

serathius commented Apr 10, 2022

mitake commented Apr 12, 2022

ahrtr Apr 13, 2022

Choose a reason for hiding this comment

ptabor Apr 13, 2022

Choose a reason for hiding this comment

ahrtr Apr 13, 2022

Choose a reason for hiding this comment

mitake Apr 14, 2022

Choose a reason for hiding this comment

mitake Apr 17, 2022

Choose a reason for hiding this comment

mitake commented Jul 3, 2022

stale bot commented Oct 15, 2022

ahrtr commented Oct 15, 2022

mitake commented Oct 29, 2022

ahrtr commented Oct 29, 2022

codecov-commenter commented Apr 10, 2022 •

edited