Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats Module #1785

Open
bradengroom opened this issue Mar 7, 2024 · 2 comments
Open

Stats Module #1785

bradengroom opened this issue Mar 7, 2024 · 2 comments
Labels
kind/proposal Something fundamentally needs to change

Comments

@bradengroom
Copy link
Contributor

Problem Statement

In order to even attempt cost-based query planning, SpiceDB needs to gather some metrics about the relationships that it is trying to traverse. This proposal is meant to more directly discuss details of a potential stats module that is alluded to in #1573.

Solution Brainstorm

It has been a while since I filed #1573, but I have been noodling on the stats module problem, and I wanted to capture some of my thoughts to see what kind of other thoughts they might generates from other folks.

There are a few high-level problems here that I'll try to ideate on:

  • Stats to track
  • HLL storage
  • Syncing
  • Deletes

Stats to track

This one is short. The stats module should be flexible enough to track additional metrics over time, but the clearest need is to track cardinality between relationships. As such, the stats module should expose the ability to fetch relationship cardinality estimates in its interface. Tracking the cardinality is touched on in #1573, but I think the clearest path is to use a structure like HLLs to estimate cardinality in a memory-efficient way.

Whenever a relationship is created, an HLL for each end of that relationship would be updated. (More later on deletes)

Additional stats may be integrated as needed.

HLL storage

This one might be tricky. I see a few options which I'll try to outline exhaustively:

  1. ❌ Store HLLs in a DB-native way
    Most DBs don't support truly native HLLs, and that is certainly true of the storage backends supported by SpiceDB. Some may require an extension to be installed or might not have a native option at all. Even if extensions existed for each backend, requiring users to install and upgrade special DB extensions is likely an unacceptable user experience for SpiceDB customers.
  2. ❌ Stats-specific storage
    A dedicated storage option (like Redis) could be pulled in for storing stats. While Redis has fantastic support for HLLs, I personally do not like this approach as it still complicates the SpiceDB deployment story and also comes with some persistence concerns. Adding yet another piece to deploy and maintain is likely unacceptable, so I won't spend more time on it.
  3. (Yes?) App-level HLLs stored as byte arrays
    Expecting byte array support for a DB is reasonably table stakes, so the DB requirement here should be easy to meet. CockroachDB, Spanner, Postgres, MySQL. In this world, SpiceDB is responsible for operating on the HLLs and serializing them out as bytes within the DB. This is more complexity within SpiceDB itself, but likely worth the tradeoff of a more simplified deploy/customer experience.

Syncing

This section assumes app-level HLLs.

These stats are meant to be a heuristics by nature, so it should be okay for the data to be a little stale. It may make sense to have a goroutine running on the side and receiving events over a channel. As it receives events, it will update some in-memory HLLs. Separately, on some cadence (X minutes, X events, etc), the in-memory HLLs should be merged with the HLLs in the database. Since we lack DB-native HLL operations, this syncing will require a read-modify-write pattern where the updated HLLs need be read, deserialized, merged with the in-memory HLLs, and serialized out to the DB. There will be a balance between choosing a syncing cadence that is high enough to not cause DB lock contention while also being low enough to be uselessly out of date.

Deletes

Deletes get tricky and have the opportunity to throw off stats when performed with significant volume. Due to the probabilistic nature of HLLs, you can't really properly "delete" from it. Separate HLLs could be used to track deletes, but it's still ultimately a guess with potential to go wrong. This isn't really unique to SpiceDB as all DB stats modules have edge cases where the stats tracking may go awry, but it's worth calling out, being aware of, and having some sort of a plan for.

Other Thoughts?

Is it worth it to consider some equivalent of the ANALYZE command to allow SpiceDB users to force their stats to be updated after major operations? This could be useful for scenarios where large deletes have occured and thrown off stats.

@bradengroom bradengroom added the kind/proposal Something fundamentally needs to change label Mar 7, 2024
@bradengroom
Copy link
Contributor Author

Copying over some great context from @josephschorr in the Discord

could be a process which uses the watch API and updates in the background and out of band
That would prevent multiple writers
Could even be sharded
And if it provided its own grpc API, SpiceDB could call that to get the stats in-memory
Likely a great way to prototype

@bradengroom
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Something fundamentally needs to change
Projects
None yet
Development

No branches or pull requests

1 participant