Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Query Planner #1573

Open
bradengroom opened this issue Oct 10, 2023 · 9 comments
Open

Proposal: Query Planner #1573

bradengroom opened this issue Oct 10, 2023 · 9 comments
Labels
area/perf Affects performance or scalability kind/proposal Something fundamentally needs to change state/needs discussion This can't be worked on yet

Comments

@bradengroom
Copy link
Contributor

bradengroom commented Oct 10, 2023

Problem Statement

Please redirect me if this is not the best place to gather feedback for an idea that is not fully formed yet. I'm looking for feedback from deeper SpiceDB experts on this idea that I might be interested in exploring if there is interest.

The first question worth answering is "why would SpiceDB need a query planner at all?". At the end of the day, SpiceDB's queries are just trying to determine if a path exists between two points. Currently, SpiceDB performs all path finding starting from resource->subject even though there may be some queries that are more performant when traversing subject->resource. While SpiceDB tuples represent a directed graph, that does not preclude SpiceDB's query engine from traversing the graph backwards via database indexes.

Why would we ever want to traverse backwards? (subject->resource)
I think it comes down to cardinality of relationships and how much data must be loaded from the data store. Let's consider a simple scenario where there's a many-to-many relationship between entities A, B, and C:
relationship

We want to know if there's a path from A:1 to C:1 through B. We have two options for fulfilling this query:

  1. Traverse A->B

    a_to_b_tuples = """
        SELECT ... FROM tuples
        WHERE res_type = 'A' AND res_id = '1'
        AND relation = 'a_to_b_relation'
        AND subj_type = 'B';
    """
    
  2. Traverse B->C

    // in batches of 100
    b_to_c_tuples = """
        SELECT ... FROM tuples
        WHERE res_type = 'B' AND res_id IN (a_to_b_tuples_subject_ids)
        AND relation = 'b_to_c_relation'
        AND subj_type = 'C'
        AND subj_id = '1';
    """
    
  3. Access granted if b_to_c_tuples is not empty.

But what happens if A:1 has many a_to_b_relations. What if there are a million? Then step 2 will chug along making many requests (with some concurrency factor) to sift through the many A->B tuples.

What if the cardinality happens to be much better going the other way? Each A has many Bs, but each B only has 1 A. It could potentially be much faster to:

  1. Traverse (backwards) C->B

    b_to_c_tuples = """
        SELECT ... FROM tuples
        WHERE res_type = 'B'
        AND relation = 'b_to_c_relation'
        AND subj_type = 'C'
        AND subj_id = '1';
    """
    
    
  2. Traverse (backwards) B->A

    // in batches of 100
    a_to_b_tuples = """
        SELECT ... FROM tuples
        WHERE res_type = 'A' AND res_id = '1'
        AND relation = 'a_to_b_relation'
        AND subj_type = 'B' and subj_id IN (b_to_c_tuples_resource_ids);
    """
    
  3. Access granted if a_to_b_tuples is not empty.

This could potentially be much faster if the relationship have lower cardinality when traversing backwards.

If a proper query planning layer existed, if could try to solve this problem and it could potentially be extended to perform other optimizations that we think of later.

Open Questions / Problems

Determining Relation Cardinality

One idea would be to use HLLs to simply count and estimate of how many unique entries we see on each side of a relation. So if two tuples are written: A:1#a_to_b_relation@B:1 and A:1#a_to_b_relation@B:2, we would end up with two HLLs:
a_to_b_relation_a_hll.Len() => 1
a_to_b_relation_b_hll.Len() => 2

At scale, these numbers could provide insight on the cardinality between the two and hint at an optimal traversal direction. Using these cardinality values between relations, we can weigh our options. Some heuristic would need to be established if there are multiple paths from A->C.

You'd want to persist this information in the datastore somehow and have nodes only leverage the information once a substantial number of data points have been gathered. All nodes need to be on the same page.

Query Planning Latency

How would a query planning step be introduced without adding significant latency to the queries? We have to be able to pay off the time we spend planning!

Fortunately, SpiceDB is purpose-built and doesn't have an infinite number of queries to serve up. A given schema only captures so many paths and clients likely only use a subset of paths that a schema is capable of. The first time SpiceDB encounters a query for some A:X#a_to_b_relation@B:Y path, the plan may be cached and reused for all queries of its shape.

Cache Hit Rate

This is likely the biggest concern. Once a query plan is chosen for a particular query type, we likely need to stick to that query plan (unless there's a REALLY good reason not to). Otherwise cache hit rate will likely be terrible. We can't be too dynamic in our execution paths. Even still, this would require some experimentation to evaluate how this idea behaves at all at scale.

Indexing & Data Size

I would need to investigate further, but my hope is that much of the indexing required may already exist since the SpiceDB already offers APIs for traversing both directions (LookupResources & LookupSubjects). If not, data increase from additional indexes could be a concern.

Query Hints?

An MVP of this could involve allowing customers to specify query hints in their schema some how (don't accept query hints at query time because of the cache hit rate problem. We don't want' people flipping back and forth.)

Other Crazy Ideas

Wouldn't it be crazy if we traversed from both sides at once and tried to meet in the middle? I have no idea how that might work right now, but the idea is an interesting future convo. 🙂 (out of scope here)

Other optimizations?...

@bradengroom bradengroom added the kind/proposal Something fundamentally needs to change label Oct 10, 2023
@ecordell
Copy link
Contributor

Nice write-up, @bradengroom!

A query planner is an idea that we've thrown around a few times but has never made it into a concrete proposal. This is a great place to start jotting ideas.

Other places we've talked about query planning being potentially valuable:

  • Any time there is an intersection or exclusion, knowing cardinality to plan a query can short-circuit the bigger branches
  • Preferring branches that are known to be caveat-free when no context is provided
  • Preferring paths that are more likely to be cached

Once a query plan is chosen for a particular query type, we likely need to stick to that query plan (unless there's a REALLY good reason not to).

Ideally we could decompose query plans into small subproblems that would be shared between many types of queries / plans so that this is not as much of an issue.

Wouldn't it be crazy if we traversed from both sides at once and tried to meet in the middle?

Have you considered starting in the middle and going outward? 😄

@bradengroom
Copy link
Contributor Author

bradengroom commented Oct 10, 2023

Ideally we could decompose query plans into small subproblems that would be shared between many types of queries / plans so that this is not as much of an issue.

If that's true then that's much more interesting. You could potentially change your course of action halfway through a query.

"I started traversing from the subject, but my estimate was off an order of magnitude. Maybe I should start traversing from the resource and try to meet myself where I've made it so far"

That also opens the door for request-time query hints (if we like the idea of query hints at all)

@bradengroom
Copy link
Contributor Author

bradengroom commented Oct 10, 2023

@ecordell @josephschorr

Thanks for the feedback!

As I think on this, I think that this might be a large change to hash out in a single thread. I'm seeing 3 distinct threads that this work can be split into:

  • Stats Module: Stats Module #1785
    • What stats do we track?
    • How do we track them efficiently?
    • How do we store them in the data stores?
    • How does this piece fit into the rest of the code?
  • Physical Plan
    • What is the data structure used to represent the query plan?
    • How do we pass it through dispatch?
  • Cost Model
    • When after we've generated multiple physical plans for a given query, how do we assign them a cost?

There's also likely a thread to spin up around an optimization layer. How do we take a physical plan and transform it into something more efficient? That can likely come later though once the code base has the concept of a query plan in place at all.

Since it sounds like the idea of a query planner already has buy-in from your internal discussions, I'd like to take a piece and start designing it with more concrete details if that's alright. Do y'all have any issues with me spinning up a new, concrete proposal for the stats module? I don't want to overwhelm folks with proposal docs if this isn't the usual flow. I'm just eager to contribute 😄 I'm still exploring the code, so this will likely take some time.

@josephschorr
Copy link
Member

There's also likely a thread to spin up around an optimization layer. How do we take a physical plan and transform it into something more efficient? That can likely come later though once the code base has the concept of a query plan in place at all.

Designs have been thrown around on an IR to do just that, but its still very much in the ideation phase

Since it sounds like the idea of a query planner already has buy-in from your internal discussions, I'd like to take a piece and start designing it with more concrete details if that's alright. Do y'all have any issues with me spinning up a new, concrete proposal for the stats module?

No issue with doing so, but it is not a priority right now either, so just be aware the response times can and likely will be lower on any such designs or PRs.

Stats is probably the best place to start with a goal of getting good stats without impacting read or write performance, which likely means a background worker updating them async.

@josephschorr
Copy link
Member

@bradengroom Any updates on the stats side? Just curious :)

@bradengroom
Copy link
Contributor Author

@bradengroom Any updates on the stats side? Just curious :)

I have not unfortunately. The demands of my day job have picked up as we've gone deeper into Q4

@jzelinskie jzelinskie added the area/perf Affects performance or scalability label Mar 6, 2024
@jzelinskie
Copy link
Member

There are two types of operations that query planners can do:

  • Heuristic/Rule based rewriting
  • Cost-based search

So far we've mostly been discussing cost-based planning, but actually I think it makes sense to first consolidate the few places where we're performing heuristic/rules rewrites already into a clean interface that can be the foundation of the planner.

@jzelinskie jzelinskie added the state/needs discussion This can't be worked on yet label Mar 6, 2024
@bradengroom
Copy link
Contributor Author

@jzelinskie agreed that might be a good first step to refactor towards the larger change. Is there a good way of hunting these rewrites down? Or does someone just need to trace the code and compile a list of such rewrites? I assume it’d all be in the check dispatch path.

@croemmich
Copy link

This would be a big win for us. Nearly all of our evaluations, especially LookupResouces, would be significantly faster with the traversal direction inverted.

We loosely followed the https://authzed.com/blog/google-cloud-iam-modeling example when creating our model. In some cases we have hundreds of thousands or millions of granted role_binding relationships, which causes things to really chug. It's compounded when the role_binding user is a group which needs to be resolved as well. We had to remove a layer from our model to make things perform "well-enough", but it would be significantly faster with a query planner inverting the traversal.

@bradengroom bradengroom changed the title Query Planner (Feedback Requested) Proposal: Query Planner May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf Affects performance or scalability kind/proposal Something fundamentally needs to change state/needs discussion This can't be worked on yet
Projects
None yet
Development

No branches or pull requests

5 participants