Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear entries in remote caches and force events on the remote site #29597

Merged
merged 3 commits into from
May 23, 2024

Conversation

ahus1
Copy link
Contributor

@ahus1 ahus1 commented May 16, 2024

Closes #29592

Handles the following cases:

  • When removing the sessions for a realm, and an entry is not in the current site's local cache cache, but in the current site's remote cache:
    • We're iterating over all entries in the external cache and remove the external ones as well.
  • When removing the sessions for a realm, and an entry is in the other site's local cache or the other site's external cache:
    • We're sending an event to all cluster sites and run the cleanup commands there as well.
  • When removing or updating a single session, but that session isn't present in the external cache, so there wouldn't be an event to be observed on the other site:
    • We're adding a dummy entry and then removing it to force an event on the other site to trigger a clear on the other site's caches (both external and internal)

@ahus1
Copy link
Contributor Author

ahus1 commented May 17, 2024

@mhajas / @pruivo - this change now includes some more aggressive clearing of sessions in the external Infinispan.

I've updated the issue's description with the cases that I think are now handled.

Closes keycloak#29592

Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
@ahus1 ahus1 force-pushed the is-29592-remote-caches-cleanup branch from b1a0147 to 707f128 Compare May 17, 2024 06:53
@ahus1 ahus1 marked this pull request as ready for review May 17, 2024 07:54
@ahus1 ahus1 requested a review from a team as a code owner May 17, 2024 07:54
Copy link
Contributor

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another place: org.keycloak.models.sessions.infinispan.changes.InfinispanChangelogBasedTransaction#replace()
If the computeIfPresent returns null, you should invoke remove to ensure the backup owners do not keep any stale value.

@@ -537,40 +535,43 @@ public void removeLocalUserSessions(String realmId, boolean offline) {

final AtomicInteger userSessionsSize = new AtomicInteger();

localCacheStoreIgnore
removeEntriesByRealm(realmId, localCacheStoreIgnore, userSessionsSize, localCache, localClientSessionCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully, realms are not removed very often :)

@@ -108,7 +114,24 @@ private <K, V extends SessionEntity> void runOnRemoteCache(TopologyInfo topology

switch (operation) {
case REMOVE:
remoteCache.remove(key);
if (Profile.isFeatureEnabled(Profile.Feature.PERSISTENT_USER_SESSIONS) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not required since removal is always replicated to the remote site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in next commit

Comment on lines 177 to 182
logger.debugf("No existing entry for %s in the remote cache to remove, might have been evicted.", key);
// force a remove to trigger an event on the remote DC to clear those
remoteCache.put(key, new SessionEntityWrapper<>(null));
if (remoteCache.withFlags(Flag.FORCE_RETURN_VALUE).remove(key) != null) {
logger.warnf("Unable to trigger the remove in the remote cache %s for key %s", remoteCache.getName(), key);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remoteCache.remove is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in next commit

@@ -130,6 +130,10 @@ protected void createRemoteEntityInCache(K key) {


V remoteSession = remoteSessionVersioned.getValue().getEntity();
if (remoteSession == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in next commit

@ahus1
Copy link
Contributor Author

ahus1 commented May 21, 2024

@pruivo - pushed one commit to solve some of your comments.

I'm not sure that InfinispanChangelogBasedTransaction#replace() is needed: You describe that I should call remove() to clear the entries of backup owners.

First, this is all about the embedded caches.

This code is used in the non-persistent sessions only, and for those there should always be all entries in memory, until they eventually expire.

For persistent sessions, we only have one owner as ensured by the CacheManagerFactory, and therefore it won't be an issue IMHO.

Let me know if I missed something. Thanks!

Closes keycloak#29592

Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
@ahus1 ahus1 force-pushed the is-29592-remote-caches-cleanup branch from 177fafd to fb2bcfd Compare May 21, 2024 13:26
@pruivo
Copy link
Contributor

pruivo commented May 21, 2024

@pruivo - pushed one commit to solve some of your comments.

I'm not sure that InfinispanChangelogBasedTransaction#replace() is needed: You describe that I should call remove() to clear the entries of backup owners.

First, this is all about the embedded caches.

This code is used in the non-persistent sessions only, and for those there should always be all entries in memory, until they eventually expire.

For persistent sessions, we only have one owner as ensured by the CacheManagerFactory, and therefore it won't be an issue IMHO.

Let me know if I missed something. Thanks!

@ahus1 you're right. I forgot we are setting owners=1.

@@ -537,40 +535,43 @@ public void removeLocalUserSessions(String realmId, boolean offline) {

final AtomicInteger userSessionsSize = new AtomicInteger();

localCacheStoreIgnore
removeEntriesByRealm(realmId, localCacheStoreIgnore, userSessionsSize, localCache, localClientSessionCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following why we iterate over entries in localCacheStoreIgnore but remove from localCache. Since each KC node is doing it, wouldn't it be enough to iterate over localCache only?

Copy link
Contributor Author

@ahus1 ahus1 May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is trying to do the following:

  • Get me all entries, but don't ask the remote store for the entries as it would be expensive
  • If I then remove items, remove them als from the store which is remote cache

Copy link
Contributor

@mhajas mhajas May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get me all entries, but don't ask the remote store for the entries as it would be expensive

Yes that is correct, however, the event is processed by each Keycloak node and this means that each ID is processed also by nodes that do not own the entry which results in 3 removals of the same ID (in 3 KC node cluster). Then we use localCache.removeAsync which works only on nodes that own the entry as we use the flag CACHE_MODE_LOCAL. All other nodes do nothing.

I just realized the reason why we are doing it this way may be that we need to remove also client sessions which is making things a little complicated as client sessions may be owned by other nodes than user sessions and we also use CACHE_MODE_LOCAL for removal. I am thinking about a few things here:

  1. What happens if some n1 owns clientSession1 and n2 owns userSession1. Is it possible that n2 removes userSession1, without removing clientSession1 before n1 iterates over userSession1 which would mean n1 never get to userSession1 and hence never removes clientSession1?
  2. I am not sure it makes sense to use CACHE_MODE_LOCAL for client sessions. Maybe we can use SKIP_CACHE_STORE and we can execute the removal only on primary owners for a user session.
  3. Also, maybe we don't need to be afraid of keeping client sessions in there since they will be evicted eventually and they cannot be used without the corresponding user session. WDYT?

If I then remove items, remove them als from the store which is remote cache

If we talk only about this line, without the line 541, I don't see where we remove also from the remote cache. We are using CACHE_MODE_LOCAL for all removals which remove only from the current node if I am not mistaken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in a meeting and we decided not to act on this comment at the moment as it would be just a performance optimization. The important part of this PR is to fix the possible issue when the remote store contains a different set of user sessions which is addressed by iterating also over the remote store.

We can continue with the performance optimization with a lower priority later.

removeEntriesByRealm(realmId, localCacheStoreIgnore, userSessionsSize, localCache, localClientSessionCache);

// iterate over the remote entries as well to ensure they are purged
removeEntriesByRealm(realmId, getCache(offline), userSessionsSize, getCache(offline), getClientSessionCache(offline));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous line removes all sessions from the local caches for the realm. This line attempts to remove also remote sessions.

Would it make sense to use RemoteCache directly for remote store removal instead of using getCache(). Also, should we add SKIP_LISTENER_NOTIFICATION flag to avoid processing of removals by each Keycloak node again? We know the local caches are already clean of sessions for the realm so we don't need to distribute these events.

If I remember correctly SKIP_LISTENER_NOTIFICATION still sends the notification to the other site in case of CrossDC setups but it would still make the number of sent events lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed one more commit with the changes you suggested. As the local caches and the remote caches have different interfaces, I needed to duplicate the method to make it work. Please have a look if it now looks better to you.

Closes keycloak#29592

Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
@ahus1 ahus1 requested a review from mhajas May 22, 2024 16:49
Copy link

@keycloak-github-bot keycloak-github-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreported flaky test detected, please review

@keycloak-github-bot
Copy link

Unreported flaky test detected

If the flaky tests below are affected by the changes, please review and update the changes accordingly. Otherwise, a maintainer should report the flaky tests prior to merging the PR.

org.keycloak.testsuite.model.session.UserSessionPersisterProviderTest#testMigrateSession

Keycloak CI - Store Model Tests

java.lang.AssertionError: expected:<3> but was:<2>
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
...

Report flaky test

Copy link
Contributor

@mhajas mhajas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @ahus1

@mhajas mhajas merged commit c6e071c into keycloak:main May 23, 2024
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote caches and other site's caches might get out-of-sync when persistent sessions are used
3 participants