Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/auth: invalidate range permission cache during recovering from… #13920

Closed
wants to merge 1 commit into from

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Apr 10, 2022

… snapshot

Fix #13883

The above issue reports a problem that authStore.Recover() doesn't invalidate rangePermCache, so an etcd node which is isolated from its cluster might not invalidate stale permission cache after resolving the network partitioning. This PR fixes the issue by invalidating the cache in a defensive manner.

cc @ptabor

@codecov-commenter
Copy link

codecov-commenter commented Apr 10, 2022

Codecov Report

Merging #13920 (2e034d2) into main (0c9a4e0) will decrease coverage by 0.80%.
The diff coverage is 84.00%.

❗ Current head 2e034d2 differs from pull request most recent head 241d211. Consider uploading reports for the commit 241d211 to get more accurate results

@@            Coverage Diff             @@
##             main   #13920      +/-   ##
==========================================
- Coverage   72.71%   71.91%   -0.81%     
==========================================
  Files         469      469              
  Lines       38398    38414      +16     
==========================================
- Hits        27923    27624     -299     
- Misses       8710     9016     +306     
- Partials     1765     1774       +9     
Flag Coverage Δ
all 71.91% <84.00%> (-0.81%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/v3rpc/rpctypes/error.go 90.47% <ø> (ø)
server/etcdserver/api/v3rpc/util.go 74.19% <ø> (ø)
server/etcdserver/errors.go 0.00% <ø> (ø)
server/etcdserver/v3_server.go 78.17% <80.00%> (-0.21%) ⬇️
client/v3/mirror/syncer.go 76.19% <100.00%> (+1.83%) ⬆️
server/etcdserver/server.go 84.38% <100.00%> (-0.49%) ⬇️
server/proxy/httpproxy/reverse.go 0.00% <0.00%> (-63.03%) ⬇️
server/proxy/httpproxy/metrics.go 38.46% <0.00%> (-61.54%) ⬇️
client/v3/snapshot/v3_snapshot.go 0.00% <0.00%> (-54.35%) ⬇️
server/proxy/httpproxy/proxy.go 25.58% <0.00%> (-46.52%) ⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c9a4e0...241d211. Read the comment docs.

@serathius
Copy link
Member

Is this reproducible on isolated members with serialiable requests? Could we maybe add an e2e tests?

@mitake
Copy link
Contributor Author

mitake commented Apr 12, 2022

Yeah, let me add e2e test cases.

@@ -388,6 +388,8 @@ func (as *authStore) Recover(be AuthBackend) {
as.tokenProvider.enable()
}
as.enabledMu.Unlock()

as.rangePermCache = make(map[string]*unifiedRangePermissions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  1. Suggest to call clearCachedPerm;
  2. There is a potential race condition, some requests (such as v3_server.go#L128 and watch.go#L235 ) coming from API (outside of the applying workflow) may be accessing the rangePermCache concurrently. It seems that we need to add a lock to protect it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Good catch.

In general the existing locking strategy seems week here:

  • The access to this cache is protected by readtx.Lock()
  • readTx is owned by backend that we are swapping here.
  • so seems theoretically possible that there would be 2 transactions concurrently accessing/modifying the cache... one having tx on old backend, the other the new backend. Closing of the old backend is fully asynchronic:
    go func() {
    lg.Info("closing old backend file")
    defer func() {
    lg.Info("closed old backend file")
    }()
    if err := oldbe.Close(); err != nil {
    lg.Panic("failed to close old backend", zap.Error(err))
    }
    }()
  • Thus seems cache should have its own lock instead of piggybacking on transaction lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus seems cache should have its own lock instead of piggybacking on transaction lock

Exactly! Please note that the cache can't be protected by the readtx.Lock(), because a batchTx and a readRx can execute concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out that! I think it's an independent issue, let me open a dedicated PR for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an independent PR here: #13954 It's great if you can review.

@ptabor ptabor marked this pull request as draft April 29, 2022 08:21
@mitake
Copy link
Contributor Author

mitake commented Jul 3, 2022

#13954 was merged so I'll resume this PR.

@stale
Copy link

stale bot commented Oct 15, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 15, 2022
@ahrtr
Copy link
Member

ahrtr commented Oct 15, 2022

@mitake we should have already resolved the issue. could you double confirm this?

FYI.
#14227
#14358
#14574

@stale stale bot removed the stale label Oct 15, 2022
@mitake
Copy link
Contributor Author

mitake commented Oct 29, 2022

@ahrtr Yes, I think we can close this PR. The most reliable way to cause this issue was membership change as shown in #14571
I think prior to #13954 I think the stale rangePermCache could cause real issues rarely because rangePermCache can be stale only if this sequence can happen: 1. network partition, 2. auth config update, 3. WAL compaction, 4. rejoin.
Note that network partition caused by restarting etcd process makes rangePermCache empty and it contributes to avoiding the inconsistency issue in the above sequence until #14571. This is because before the PR, rangePermCache just needs to be updated during data access. So having empty rangePermCache after restarting isn't harmful.

@mitake mitake closed this Oct 29, 2022
@ahrtr
Copy link
Member

ahrtr commented Oct 29, 2022

Thanks @mitake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

AuthStore::Recover is not invalidating caches.
5 participants