Stats Module #1785

bradengroom · 2024-03-07T04:10:17Z

Problem Statement

In order to even attempt cost-based query planning, SpiceDB needs to gather some metrics about the relationships that it is trying to traverse. This proposal is meant to more directly discuss details of a potential stats module that is alluded to in #1573.

Solution Brainstorm

It has been a while since I filed #1573, but I have been noodling on the stats module problem, and I wanted to capture some of my thoughts to see what kind of other thoughts they might generates from other folks.

There are a few high-level problems here that I'll try to ideate on:

Stats to track
HLL storage
Syncing
Deletes

Stats to track

This one is short. The stats module should be flexible enough to track additional metrics over time, but the clearest need is to track cardinality between relationships. As such, the stats module should expose the ability to fetch relationship cardinality estimates in its interface. Tracking the cardinality is touched on in #1573, but I think the clearest path is to use a structure like HLLs to estimate cardinality in a memory-efficient way.

Whenever a relationship is created, an HLL for each end of that relationship would be updated. (More later on deletes)

Additional stats may be integrated as needed.

HLL storage

This one might be tricky. I see a few options which I'll try to outline exhaustively:

❌ Store HLLs in a DB-native way
Most DBs don't support truly native HLLs, and that is certainly true of the storage backends supported by SpiceDB. Some may require an extension to be installed or might not have a native option at all. Even if extensions existed for each backend, requiring users to install and upgrade special DB extensions is likely an unacceptable user experience for SpiceDB customers.
❌ Stats-specific storage
A dedicated storage option (like Redis) could be pulled in for storing stats. While Redis has fantastic support for HLLs, I personally do not like this approach as it still complicates the SpiceDB deployment story and also comes with some persistence concerns. Adding yet another piece to deploy and maintain is likely unacceptable, so I won't spend more time on it.
(Yes?) App-level HLLs stored as byte arrays
Expecting byte array support for a DB is reasonably table stakes, so the DB requirement here should be easy to meet. CockroachDB, Spanner, Postgres, MySQL. In this world, SpiceDB is responsible for operating on the HLLs and serializing them out as bytes within the DB. This is more complexity within SpiceDB itself, but likely worth the tradeoff of a more simplified deploy/customer experience.

Syncing

This section assumes app-level HLLs.

These stats are meant to be a heuristics by nature, so it should be okay for the data to be a little stale. It may make sense to have a goroutine running on the side and receiving events over a channel. As it receives events, it will update some in-memory HLLs. Separately, on some cadence (X minutes, X events, etc), the in-memory HLLs should be merged with the HLLs in the database. Since we lack DB-native HLL operations, this syncing will require a read-modify-write pattern where the updated HLLs need be read, deserialized, merged with the in-memory HLLs, and serialized out to the DB. There will be a balance between choosing a syncing cadence that is high enough to not cause DB lock contention while also being low enough to be uselessly out of date.

Deletes

Deletes get tricky and have the opportunity to throw off stats when performed with significant volume. Due to the probabilistic nature of HLLs, you can't really properly "delete" from it. Separate HLLs could be used to track deletes, but it's still ultimately a guess with potential to go wrong. This isn't really unique to SpiceDB as all DB stats modules have edge cases where the stats tracking may go awry, but it's worth calling out, being aware of, and having some sort of a plan for.

Other Thoughts?

Is it worth it to consider some equivalent of the ANALYZE command to allow SpiceDB users to force their stats to be updated after major operations? This could be useful for scenarios where large deletes have occured and thrown off stats.

The text was updated successfully, but these errors were encountered:

bradengroom · 2024-03-07T14:04:17Z

Copying over some great context from @josephschorr in the Discord

could be a process which uses the watch API and updates in the background and out of band
That would prevent multiple writers
Could even be sharded
And if it provided its own grpc API, SpiceDB could call that to get the stats in-memory
Likely a great way to prototype

bradengroom · 2024-03-15T03:44:20Z

More context in thread here
https://discord.com/channels/844600078504951838/900405749405089812/1215155499264385045

bradengroom added the kind/proposal Something fundamentally needs to change label Mar 7, 2024

bradengroom mentioned this issue Mar 7, 2024

Proposal: Query Planner #1573

Open

josephschorr mentioned this issue Apr 19, 2024

Proposal: Add relationship count API #1860

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats Module #1785

Stats Module #1785

bradengroom commented Mar 7, 2024

bradengroom commented Mar 7, 2024

bradengroom commented Mar 15, 2024

Stats Module #1785

Stats Module #1785

Comments

bradengroom commented Mar 7, 2024

Problem Statement

Solution Brainstorm

Stats to track

HLL storage

Syncing

Deletes

Other Thoughts?

bradengroom commented Mar 7, 2024

bradengroom commented Mar 15, 2024