Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add relationship count API #1860

Open
ryaneorth opened this issue Apr 8, 2024 · 7 comments
Open

Proposal: Add relationship count API #1860

ryaneorth opened this issue Apr 8, 2024 · 7 comments
Labels
kind/proposal Something fundamentally needs to change

Comments

@ryaneorth
Copy link
Contributor

ryaneorth commented Apr 8, 2024

See below for the current proposal


Problem Statement

We are migrating our data from a legacy authorization system into SpiceDB and would like a way to easily verify the relationship counts in order to validate the accuracy of our data. Specifically, we would like to be able to do this in the following scenarios:

  1. After hydrating data to verify counts match our legacy system
  2. Periodically (e.g. daily) to ensure we our real-time-update logic from our legacy system to our SpiceDB instance are in sync, allowing us to investigate further if this is not the case

We can currently solve scenario 1 above by subscribing the Watch API while performing a hydration and maintaining the counts internally. For scenario 2 we can do a filtered ReadRelationships call for each relationship, but this requires retrieving more data than necessary from the database since we only care about the counts in the end. The ReadRelationships API also has the limitation of requiring us to make the calls in chunks of 1000, which forces us to process things serially.

In both cases having a relationships count API that simply performs a SELECT count(*) from relation_tuple WHERE ... clause against the database would be much more efficient/user friendly.

Solution Brainstorm

Implement a CountRelationships API with the following spec:

rpc CountRelationships(CountRelationshipsRequest)
  returns (stream CountRelationshipsResponse) {
    option (google.api.http) = {
      post: "/v1/relationships/count"
      body: "*"
    };
}

message CountRelationshipsRequest {
  Consistency consistency = 1;

  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 2
      [ (validate.rules).message.required = true ];
}

message CountRelationshipsResponse {
  // read_at is the ZedToken at which the relationship count was performed
  ZedToken read_at = 1 [ (validate.rules).message.required = true ];

  uint64 relationship_count = 2;
}
@ryaneorth ryaneorth added the kind/proposal Something fundamentally needs to change label Apr 8, 2024
@ryaneorth ryaneorth changed the title Add relationship count API Data Validation: Add relationship count API Apr 10, 2024
@josephschorr
Copy link
Member

josephschorr commented Apr 16, 2024

My concerns with this idea are two-fold:

  1. count(*) operations can be extremely heavy on databases, especially in CRDB where they are essentially linear. I believe this would be somewhat mitigated by the query only hitting the index(es) we already have, but I'm not convinced it wouldn't become a large issue, especially if the API is invoked multiple times a minute (or worst, many times per second)
  2. What is the intention if the found count differs from that expected? At that point, there is some sort of data inconsistency, but unlike the watch solution, there is no means to determine exactly what is missing/broken/wrong. Would the intent then be to delete all matching relationships and re-hydrate?

If the intention is to only have this used during the initial hydration and subsequently for a small window, I believe having a service which consumes the Watch API and keeps a real-time count (bucketed by whatever filters are useful) might be a better solution: it could be an open source solution as well, configurable from the command line to take in filters, keep a real-time count and, using our ability to go backwards in time with Watch, be able to "pick up" if the watcher goes down

@ryaneorth
Copy link
Contributor Author

Thanks for the response @josephschorr . I'll do my best to answer your questions:

count(*) operations can be extremely heavy on databases, especially in CRDB where they are cockroachdb/cockroach#7825. I believe this would be somewhat mitigated by the query only hitting the index(es) we already have, but I'm not convinced it wouldn't become a large issue, especially if the API is invoked multiple times a minute (or worst, many times per second)

This is a legitimate concern, but I do think the existing indexes will help here quite a bit. I ran multiple different count(*) queries that could be formulated by different RelationshipFilters representing the types of queries which theoretically be provided to this API:

  • RelationshipFilter: resource_type: user

    > explain select count(*) from relation_tuple where namespace='user';                                                                                     
                                               info
    ------------------------------------------------------------------------------------------
      distribution: full
      vectorized: true
    
      • group (scalar)
      │ estimated row count: 1
      │
      └── • scan
            estimated row count: 1,495,925 (0.75% of the table; stats collected 21 days ago)
            table: relation_tuple@pk_relation_tuple
            spans: [/'user' - /'user']
    (10 rows)
    
     > select count(*) from relation_tuple where namespace='user';                                                                                             
       count
    -----------
      1371089
    (1 row)
    
    Time: 757ms total (execution 708ms / network 49ms)
    
  • RelationshipFilter: resource_type: user, optional_resource_id: abc

      > explain select count(*) from relation_tuple where namespace='user' AND object_id='abc';   
                                                                                                  info
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        distribution: full
        vectorized: true
    
        • group (scalar)
        │ estimated row count: 1
        │
        └── • scan
              estimated row count: 0 (<0.01% of the table; stats collected 21 days ago)
              table: relation_tuple@pk_relation_tuple
              spans: [/'user'/'abc' - /'user'/'abc']
      (10 rows)
    
      Time: 32ms total (execution 1ms / network 31ms)
    
      > select count(*) from relation_tuple where namespace='user' AND object_id='a';           
        count
      ---------
            2
      (1 row)
    
      Time: 35ms total (execution 3ms / network 32ms)
    
  • RelationshipFilter: resource_type: user, optional_resource_id_prefix: a

      > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%';                              
                                                              info
    --------------------------------------------------------------------------------------------------------------------------
      distribution: full
      vectorized: true
    
      • group (scalar)
      │ estimated row count: 1
      │
      └── • scan
            estimated row count: 17,790 (<0.01% of the table; stats collected 21 days ago)
            table: relation_tuple@pk_relation_tuple
            spans: [/'user'/'a' - /'user'/'b')
    (10 rows)
    
    Time: 35ms total (execution 1ms / network 34ms)
    
    > select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%';                                      
      count
    ---------
        380
    (1 row)
    
    Time: 35ms total (execution 4ms / network 31ms)
    
    
  • RelationshipFilter: resource_type: user, optional_resource_id_prefix: a, relation: manager

      > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager';                                     
                                              info
    --------------------------------------------------------------------------------------------
    distribution: full
    vectorized: true
    
    • group (scalar)
    │ estimated row count: 1
    │
    └── • filter
        │ estimated row count: 0
        │ filter: relation = 'manager'
        │
        └── • scan
              estimated row count: 206,272 (0.10% of the table; stats collected 21 days ago)
              table: relation_tuple@pk_relation_tuple
              spans: [/'user'/'a'/'manager' - /'user'/'b')
    
    index recommendations: 1
    1. type: index creation
      SQL command: CREATE INDEX ON "spicedb-production".public.relation_tuple (relation);
    (18 rows)
    
    Time: 38ms total (execution 6ms / network 32ms)
    
    > select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager';                                             
    count
    ---------
      106
    (1 row)
    
    Time: 56ms total (execution 3ms / network 53ms)
    
  • RelationshipFilter: resource_type: user, optional_resource_id_prefix: a, relation: manager, subject_type: user

      > explain select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager' AND userset_namespace='user'; 
                                                              info
      ---------------------------------------------------------------------------------------------------------------
      distribution: full
      vectorized: true
    
      • group (scalar)
      │ estimated row count: 1
      │
      └── • filter
          │ estimated row count: 3
          │ filter: (relation = 'manager') AND (userset_namespace = 'user')
          │
          └── • scan
                estimated row count: 1,353,438 (0.68% of the table; stats collected 21 days ago)
                table: relation_tuple@pk_relation_tuple
                spans: [/'user'/'a'/'manager'/'user' - /'user'/'b')
    
      index recommendations: 1
      1. type: index creation
        SQL command: CREATE INDEX ON "spicedb-production".public.relation_tuple (namespace, relation, userset_namespace);
    (18 rows)
    
    Time: 47ms total (execution 6ms / network 41ms)
    
    > select count(*) from relation_tuple where namespace='user' AND object_id LIKE 'a%' AND relation='manager' AND userset_namespace='user';         
      count
    ---------
      1850
    (1 row)
    
    Time: 46ms total (execution 15ms / network 31ms)
    

As a note there are a total 202,626,032 of relations in this CockroachDB database. While I recognize the frequency of different values may lead to varying performance, overall the indexes already implemented allowed the search space to be vastly reduced.

Furthermore, it would be acceptable in my mind to execute these queries with AS OF SYSTEM TIME follower_read_timestamp(); ensuring the request is made against any available replica. Because this is more of a "stats" API I personally don't think exact-up-to-date numbers are strictly required.

If this really does become a problem where people are abusing this API I think adding some form of rate limiting is reasonable and is probably something that should be considered for APIs in general at some point in time.

What is the intention if the found count differs from that expected? At that point, there is some sort of data inconsistency, but unlike the watch solution, there is no means to determine exactly what is missing/broken/wrong. Would the intent then be to delete all matching relationships and re-hydrate?

It depends on the scenario. If it is a systemic issue then a bulk export on a subset of the data is likely required to investigate the root cause. If it is a one-off issue, then dehydration/rehydration of the data is likely the simplest fix.

@josephschorr
Copy link
Member

So it occurred to us that the #1785 issue might be an avenue to solve this problem:

  1. We add a registration API called TrackRelationshipCount (and UntrackRelationshipCount) that takes in a relationships filter
  2. The CountRelationships API can only be called once the tracking/registration API has been invoked.
  3. We start with the CountRelationships API temporarily invoking count(*), but with registration still required.
  4. We build the stats service (Stats Module #1785) to provide not only internal stats stored using HLLs but also the real counts for any registered filters using the above API. Once the stats service is ready, we transition the CountRelationships API from using the temporary call-count-* to using the stats API

@josephschorr josephschorr changed the title Data Validation: Add relationship count API Proposal: Add relationship count API May 2, 2024
@josephschorr
Copy link
Member

Codifying this into a proposal:

CountRelationships Proposal

Add the ability to track and retrieve the count of relationships based on registered filters in SpiceDB. This will be used both by external users for data validation purposes and, eventually, by SpiceDB itself for use in the eventual query planner (see #1785 and #1573)

Proposal

SpiceDB will add the following APIs:

// RegisterRelationshipCounter registers a new filter for counting relationships. A filter must be registered before
// a count can be requested.
rpc RegisterRelationshipCounter(RegisterRelationshipCounterRequest) returns (RegisterRelationshipCounterResponse)

// CountRelationships returns the count of relationships for *pre-registered* filter.
rpc CountRelationships(CountRelationshipsRequest) returns (CountRelationshipsResponse)

// UnregisterRelationshipCounter unregisters a new filter for counting relationships.
rpc UnregisterRelationshipCounter(UnregisterRelationshipCounterRequest) returns (UnregisterRelationshipCounterResponse)

message RegisterRelationshipCounterRequest {
  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 1
      [ (validate.rules).message.required = true ];
}

message RegisterRelationshipCounterResponse {}

message CountRelationshipsRequest {
  Consistency consistency = 1;

  // relationship_filter defines the filter to be applied to the relationships
  // to be counted.
  RelationshipFilter relationship_filter = 2
      [ (validate.rules).message.required = true ];
}

message CountRelationshipsResponse {
  // read_at is the ZedToken at which the relationship count was performed
  ZedToken read_at = 1 [ (validate.rules).message.required = true ];

  uint64 relationship_count = 2;
}

message UnregisterRelationshipCounterRequest {
  RelationshipFilter relationship_filter = 1
      [ (validate.rules).message.required = true ];
}

message UnregisterRelationshipCounterResponse {}

Workflow and Implementation

The workflow with consist of registering one (or more) relationship filters, which will be stored in the datastore.

For the first-pass, naive implementation, the CountRelationshipsRequest will issue a count(*) with the filter to the underlying datastore.

For the second-pass, proper implementation, a distinct service will be added to SpiceDB to use the watch API to keep the counts up-to-date in-memory, with CountRelationshipsRequest dispatching to the service, and the service storing the counts on an interval (5m? 10m? 30m?) in the backing datastore alongside the last revision for which it was updated

@ecordell
Copy link
Contributor

ecordell commented May 2, 2024

a distinct service will be added to SpiceDB to use the watch API to keep the counts up-to-date in-memory

Instead of a distinct service, we could use the DB to coordinate a leader election among SpiceDB replicas and choose one to hold the watch / do the counts.

@ryaneorth
Copy link
Contributor Author

ryaneorth commented May 3, 2024

Thanks for putting together this official proposal @josephschorr ! Do you anticipate having any kind of guardrails in terms of the number of relationship counters that can be registered at a given time.

Also, the RelationshipFilter allows the user to specify an optional_resource_id_prefix. While this will be very useful, I could see this causing a fair amount of load on the backing datastore when performing a count. Do you have any concerns with that? To be clear: for our use case we'd want to have this optional_resource_id_prefix and are willing to accept the load it may have on the database for the initial implementation.

@josephschorr
Copy link
Member

Do you anticipate having any kind of guardrails in terms of the number of relationship counters that can be registered at a given time.

Yes, there will likely be a (configurable) limit

Do you have any concerns with that?

Yes, but fortunately it should be an index scan in most cases and once there is a real counting service, it will be handled in memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Something fundamentally needs to change
Projects
None yet
Development

No branches or pull requests

3 participants