Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce in-memory caching package #766

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

souleb
Copy link
Member

@souleb souleb commented Apr 25, 2024

This PR will allow caching the authentication credentials retrieved by pkg/oci/auth. It should also enable future usage of the caching underlying mechanism.

Part of: #642

Store design

The store is K/V store that can store arbitrary objects. The store interface is a simplified version of client go store. If an object can be cached there, it should be in our store implementations. This is desirable, because the primary envisioned place for the store usage is during a reconciliation of custom resources retrieved from a shared informer cache.

The keys are generated dynamically with deterministic function. Accepting a function instead of key strings enables user to determine themselves the keys uniqueness constraint.

They store must be thread safe, it should support concurrent programs.

There are 2 main uses cases:

    1. cache authentication credentials. A credentials key is written, read and overwritten multiple times by a single goroutine. If we plan to store a credential and refresh it every 6 hours and read every 30 minutes(reconciliation interval), the read/write ratio is 12:1. This is a likely scenario. Authentication credentials are expirable.
    1. cache helm indexes. The new Store interface and underlying implementation should allow replacing https://github.com/fluxcd/source-controller/tree/main/internal/cache which is today used to cache helm indexes. A helm index is written, read and overwritten multiple times by multiple goroutines. Usually an index is stored and read multiple times (reconciliation of all dependent custom resources * interval) before being overwritten by a new version. It is read intensive. An index is read from local storage before being cached, if the cache is full, we could have several Custom Resources loading the same index in memory, in order to avoid that, we should evict older keys to make room.

Based on the two scenario above, the store should be optimized for reads.

We also have a scenario that needs keys to expirable and another that need them be evicted based on usage.

Hence two implementations are provided:

  • An expirable store for use case 1
  • An LRU cache for use case 2

The store interface is a simplified version of client go store. If an
object can be cached there, it should be in our store implementations.

We also provide a default way to generate keys, but we allow users to
provide their own function. This should enable all needed levels of
isolation.

Signed-off-by: Soule BA <bah.soule@gmail.com>
@souleb souleb requested a review from a team as a code owner April 25, 2024 15:17
@souleb souleb marked this pull request as draft April 25, 2024 15:17
cache/cache.go Outdated Show resolved Hide resolved
Signed-off-by: Soule BA <bah.soule@gmail.com>
Signed-off-by: Soule BA <bah.soule@gmail.com>
Signed-off-by: Soule BA <bah.soule@gmail.com>
The cache instrumentation purpose is to be self contained. We want to
avoid leaking any of the logic here in consuming code.

Signed-off-by: Soule BA <bah.soule@gmail.com>
Signed-off-by: Soule BA <bah.soule@gmail.com>
cache/lru.go Outdated Show resolved Hide resolved
cache/metrics.go Outdated Show resolved Hide resolved
cache/cache.go Outdated Show resolved Hide resolved
@souleb souleb force-pushed the enable-cachin-auth-tokens branch from f79844e to f5a5a36 Compare May 3, 2024 13:27
@errordeveloper
Copy link
Contributor

errordeveloper commented May 8, 2024

Thanks for working on this, @souleb!

The store interface is a simplified version of client go store.

It would be handy to have a link in the code. At the moment this is only mentioned in a commit message and PR description.

What may need to be made more explicit is that the sore is persisted to a file.

@souleb
Copy link
Member Author

souleb commented May 18, 2024

Thanks for working on this, @souleb!

The store interface is a simplified version of client go store.

It would be handy to have a link in the code. At the moment this is only mentioned in a commit message and PR description.

What may need to be made more explicit is that the sore is persisted to a file.

There is a comment in store.go pointing to https://pkg.go.dev/k8s.io/client-go/tools/cache#Store.

Persisting to a file will be made in a follow up PR where we will outline the use case we have for it.

@souleb souleb force-pushed the enable-cachin-auth-tokens branch 4 times, most recently from 8eec6c2 to 4ddba8d Compare May 19, 2024 21:24
@souleb souleb marked this pull request as ready for review May 19, 2024 21:24
@souleb souleb force-pushed the enable-cachin-auth-tokens branch from 4ddba8d to a108669 Compare May 19, 2024 21:51
@stefanprodan stefanprodan changed the title Auth token caching Introduce in-memory caching package May 23, 2024
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @souleb 🏅

cache/store.go Outdated Show resolved Hide resolved
cache/store.go Outdated Show resolved Hide resolved
cache/store.go Show resolved Hide resolved
cache/store.go Outdated Show resolved Hide resolved
Comment on lines +111 to +114
if key, ok := any(object).(ExplicitKey); ok {
return string(key), nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is an ExplicitKey already, which is of string type, what's the use case for passing it to IdentifiableObjectKeyFunc to obtain a string again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you always use the Get() function and never GetByKey() you might have some objects for which you only have a key. The ExplicitKey solve that.

Comment on lines +121 to +124
// StoreObject is a wrapper for an object with its identifying key.
// It is used to store objects in a Store.
type StoreObject[T any] struct {
Copy link
Contributor

@darkowlzz darkowlzz Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding the usage of generics, please correct me, but this says that this is used to store objects in a Store and I do see it being passed to cache.Add() in the tests. Can't we use this StoreObject data type in the Store interface instead of any? As StoreObject in itself contains any. I'm thinking about it because I see very few usage of StoreObject data type.

cache/cache.go Show resolved Hide resolved
Signed-off-by: Soule BA <bah.soule@gmail.com>
@souleb souleb force-pushed the enable-cachin-auth-tokens branch from a108669 to c6ec0fb Compare June 4, 2024 08:50
Comment on lines +78 to +80
if opt.registerer == nil {
return nil, ErrNoRegisterer
}
Copy link
Contributor

@darkowlzz darkowlzz Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this makes metrics a requirement instead of optional?
In our reconciler metrics, we don't do anything if there's no recorder/collector configured and let the component run, refer https://github.com/fluxcd/pkg/blob/runtime/v0.47.1/runtime/controller/metrics.go#L79 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not optional at the moment. I'll have to refactor how we call the metrics funcs. I'll do that.

Comment on lines +144 to +145
c.mu.Unlock()
return KeyError{object, ErrFull}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this returns an error and doesn't add the object in the cache, shouldn't it also have

c.metrics.incCacheRequests(StatusFailure)

?
Same in Update() below.

Comment on lines +119 to +121
key, err := c.keyFunc(object)
if err != nil {
return KeyError{object, err}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this too be recorded as a failed request in metrics?
Same in Update() below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the keyFunc is wrong, many actions will result in error due to user provided parameter. This might have a consequence of too many metrics. Also this error is not on the cache side, so I wouldn't record it.

Copy link
Contributor

@darkowlzz darkowlzz Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the key derivation is part of the cache, the cache receives a request even for bad input. Key derivation failure takes place within the cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the LRU cache, this seems to be implemented as suggested, except for in the Add() method.

if item.expiration > 0 {
if item.expiration < time.Now().UnixNano() {
c.mu.RUnlock()
c.metrics.incCacheRequests("succes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.metrics.incCacheRequests("succes")
c.metrics.incCacheRequests(StatusSuccess)

Comment on lines +202 to +211
if c.labelsFunc != nil {
lvs, err = c.labelsFunc(object, len(c.metrics.getExtraLabels()))
if err != nil {
return res, false, KeyError{object, err}
}
}
key, err := c.keyFunc(object)
if err != nil {
return res, false, KeyError{object, err}
}
Copy link
Contributor

@darkowlzz darkowlzz Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the failures in both of the above be recorded as failed cache request metrics?

Comment on lines +270 to +271
if err != nil {
return KeyError{object, err}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed cache request metric here too?


// Resize resizes the cache and returns the number of index removed.
func (c *cache[T]) Resize(size int) int {
overflow := len(c.items) - size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some guard for the overflow value would help prevent panic due to out of range errors. Passing any negative number as the size results in out of bound failure.
Maybe check the overflow value with the total capacity before proceeding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make size an uint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that a uint will need to be converted to an int for subtracting with the length and the integer overflow will also result in a negative number.
But maybe you can deal with it in some other way.

cache/cache.go Show resolved Hide resolved
// GetExpiration returns the expiration for the given key.
// Returns zero if the key is not in the cache or the item
// has already expired.
func (c *Cache[T]) GetExpiration(object T) (time.Duration, error) {
Copy link
Contributor

@darkowlzz darkowlzz Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the definition of this is making me wonder about the return type.
In terms of the Expirable interface, accepting a duration is easier to set an expiration for now + x duration. But for getting the expiration, if the same duration is returned, it's difficult to know the exact time as the duration was calculated relative to when it was obtained. It would be easier to use if the exact time is obtained for GetExpiration.

Looking at its usage in #776, for example for AWS, the ECR token provides an exact time, but we calculate a relative time and return that. Then use that relative time to set the cache expiration. If there's any delay between obtaining the token and caching it due to any reason, world stops due to GC or the system hangs momentarily, the relative time will be incorrect.

Comment on lines +34 to +37
type testObject struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
}
Copy link
Contributor

@darkowlzz darkowlzz Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing this with the PartialObjectMetadata from https://github.com/kubernetes/apimachinery/blob/v0.30.1/pkg/apis/meta/v1/types.go#L1496-L1502 ?

if c.closed {
return ErrClosed
}
c.janitor.stop <- true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no cleanup interval is passed as an option when creating the cache, the janitor goroutine doesn't get started. But upon calling Close() on such a cache, the program hangs because the above code tries to send data to a channel that has not started receiving anything. I encountered this in TestCache test which calls Close() at the very end.
Although there may be no need to call Close in such a situation, we can handle this gracefully by either including a default interval for the cleanup to always run the janitor, or check the interval to find out if the janitor goroutine may be running before running the code above.
Since Delete() depends on the janitor to actually delete the expired items from cache, the janitor seems to be an essential part of the cache for proper functioning. This makes me favor a default cleanup interval approach that ensures that the janitor always runs.

}

// Delete an item from the cache. Does nothing if the key is not in the cache.
// It actually sets the item expiration to now, so that it will at the cleanup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... will be deleted at the cleanup" ?

mu sync.RWMutex
}

var _ Expirable[any] = &Cache[any]{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to also add

var _ Store[any] = &Cache[any]{}

similar to the one in lru.go to ensure it satisfies the interface.

Comment on lines +139 to +143
// if node is head or tail, do nothing
if key == c.head.key || key == c.tail.key {
c.metrics.incCacheRequests(StatusSuccess)
return nil
}
Copy link
Contributor

@darkowlzz darkowlzz Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is making me question if the design is a bit accidental. Reading the LRU.add(), I'm visualizing the linked-list and the head/tail pointers as

image

Head and tail don't contain object data. They are like cursors/references pointing to the last and first element of the list. They aren't counted as cache elements. But the last and first nodes also point to these empty head and tail as their previous and next nodes, respectively. The key of head and tail are always empty as they don't contain any object.
The above code checks to see if the node that's requested to be deleted matches with head or tail. Based on my above understanding, this if condition won't be true because of how LRU.add() inserts the objects and organizes the cache.

Am I misunderstanding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants