Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added a CQ ID Salt to the CQ ID generation to support overlapping sync #1141

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yvardhineni
Copy link

ISSUE

cloudquery/cloudquery#14590
Right now the behaviour of overlapping syncs of same resource is undefined as they always refer to the same primary keys while writing

PROPOSAL

Accept a custom hash salt as a source spec which can be used as a custom salt while generating deterministic cq_ids.
This way we can get different cq_ids across different syncs even for the same resource.

This holds true when we use deterministic_cq_id: true && pk_mode: _cq_id_only

TODO

Need to update source spec and other parts of Plugin SDKs to support this

@github-actions github-actions bot added the feat label Aug 10, 2023
@yvardhineni yvardhineni changed the title feat: Added a CQ ID Salt to the CQ ID generation feat: Added a CQ ID Salt to the CQ ID generation to support overlapping sync Aug 10, 2023
@github-actions github-actions bot added feat and removed feat labels Aug 10, 2023
Copy link
Contributor

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @yvardhineni.
Do you mind sharing example config files with overlapping resources, and example queries to extract the data once it's synced with different CQId values?

I understand this makes the behavior of overlapping syncs consistent, but I'm not sure how one can use the data without dedicated queries for the overlapping resources (those will get duplicated correct?)

@yvardhineni
Copy link
Author

kind: source
spec:
  # Source spec section
  name: "azure-<salt-1>"
  path: "cloudquery/azure"
  version: "v9.3.0"
  destinations: ["postgresql"]
  tables: ["azure_compute_virtual_machines"]
  deterministic_id: true
  cq_id_salt: salt-1 #proposed
  spec:
    # Optional parameters
   subscriptions: [1,2,3]
--
kind: source
spec:
  # Source spec section
  name: "azure-<salt-2>"
  path: "cloudquery/azure"
  version: "v9.3.0"
  destinations: ["postgresql"]
  deterministic_id: true
  cq_id_salt: salt-1 #proposed
  tables: ["azure_compute_virtual_machines"]
  spec:
    # Optional parameters
    subscriptions: [2]

Here we can see that subscription id is an overlapping resource across configs, here the duplication of data is expected as that can be one of the use case of User and it will be consistent across parallel runs too as we are taking pk_mode: cq_id_only

and also to note it won't alter the existing behaviour

how one can use the data without dedicated queries for the overlapping resources (those will get duplicated correct?)

We can make the queries scoped to _cq_source_name as it is different for the different syncs as defined here

this way we can bring the consistency for overlapping syncs

@erezrokah
Copy link
Contributor

this way we can bring the consistency for overlapping syncs

Won't it be better in this case to drop the second source configuration? What's the reason to have it?

@yvardhineni
Copy link
Author

Agreed, but when we want to run as them as two separate sync jobs and when it runs concurrently they are overriding the source names which is leading to undesired behaviour.

@yvardhineni
Copy link
Author

This is highly inconsistent when we have the granular permission on resources and those permissions are on overlapped resources.

For e.g.
We have a set of creds-1 which have the access to azure resources R1, R2, R3
Then we have a set of creds-2 which have access to azure resources R3,R4,R5

When we run with these creds we will end up in inconsistency

@erezrokah
Copy link
Contributor

Ah OK I think I get it. So the use case for having overlapping syncs is having overlapping credentials?

@yvardhineni
Copy link
Author

Yeah, thats one such case

@yvardhineni
Copy link
Author

@erezrokah @yevgenypats could you please help with further steps on this

@yevgenypats
Copy link
Member

yevgenypats commented Aug 11, 2023

@erezrokah @yevgenypats could you please help with further steps on this

Hi, @yvardhineni, thanks for this PR! This sprint we are working on other priorities at the moment (SDKs for Python/Javascript/Java), but we will review this as soon as we are available. What is the urgency of this? Is it for internal use or commercial offering? If this is urgent - take a look at our support so we can prioritize some of the PRs reviews especially like this one that needs additional care and time. Alternatively, you can maintain a private/public fork so you don't get blocked until we merge it in.

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (70d12e4) 48.66% compared to head (c72d99e) 48.66%.

❗ Current head c72d99e differs from pull request most recent head 9bf5699. Consider uploading reports for the commit 9bf5699 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##             main    cloudquery/plugin-sdk#1141    +/-   ##
========================================
  Coverage   48.66%   48.66%            
========================================
  Files          86       85     -1     
  Lines        8012     7841   -171     
========================================
- Hits         3899     3816    -83     
+ Misses       3760     3690    -70     
+ Partials      353      335    -18     
Files Changed Coverage Δ
scheduler/scheduler.go 53.22% <0.00%> (-0.88%) ⬇️
scheduler/scheduler_dfs.go 60.89% <0.00%> (ø)
schema/resource.go 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yvardhineni
Copy link
Author

@yevgenypats Could you please update your ETA to look at this PR? Please let us know if there is any concerns with the current approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants