New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added a CQ ID Salt to the CQ ID generation to support overlapping sync #1141
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @yvardhineni.
Do you mind sharing example config files with overlapping resources, and example queries to extract the data once it's synced with different CQId values?
I understand this makes the behavior of overlapping syncs consistent, but I'm not sure how one can use the data without dedicated queries for the overlapping resources (those will get duplicated correct?)
kind: source
spec:
# Source spec section
name: "azure-<salt-1>"
path: "cloudquery/azure"
version: "v9.3.0"
destinations: ["postgresql"]
tables: ["azure_compute_virtual_machines"]
deterministic_id: true
cq_id_salt: salt-1 #proposed
spec:
# Optional parameters
subscriptions: [1,2,3]
--
kind: source
spec:
# Source spec section
name: "azure-<salt-2>"
path: "cloudquery/azure"
version: "v9.3.0"
destinations: ["postgresql"]
deterministic_id: true
cq_id_salt: salt-1 #proposed
tables: ["azure_compute_virtual_machines"]
spec:
# Optional parameters
subscriptions: [2] Here we can see that subscription id is an overlapping resource across configs, here the duplication of data is expected as that can be one of the use case of User and it will be consistent across parallel runs too as we are taking pk_mode: cq_id_only and also to note it won't alter the existing behaviour
We can make the queries scoped to _cq_source_name as it is different for the different syncs as defined here this way we can bring the consistency for overlapping syncs |
Won't it be better in this case to drop the second source configuration? What's the reason to have it? |
Agreed, but when we want to run as them as two separate sync jobs and when it runs concurrently they are overriding the source names which is leading to undesired behaviour. |
This is highly inconsistent when we have the granular permission on resources and those permissions are on overlapped resources. For e.g. When we run with these creds we will end up in inconsistency |
Ah OK I think I get it. So the use case for having overlapping syncs is having overlapping credentials? |
Yeah, thats one such case |
@erezrokah @yevgenypats could you please help with further steps on this |
Hi, @yvardhineni, thanks for this PR! This sprint we are working on other priorities at the moment (SDKs for Python/Javascript/Java), but we will review this as soon as we are available. What is the urgency of this? Is it for internal use or commercial offering? If this is urgent - take a look at our support so we can prioritize some of the PRs reviews especially like this one that needs additional care and time. Alternatively, you can maintain a private/public fork so you don't get blocked until we merge it in. |
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## main cloudquery/plugin-sdk#1141 +/- ##
========================================
Coverage 48.66% 48.66%
========================================
Files 86 85 -1
Lines 8012 7841 -171
========================================
- Hits 3899 3816 -83
+ Misses 3760 3690 -70
+ Partials 353 335 -18
☔ View full report in Codecov by Sentry. |
c72d99e
to
9bf5699
Compare
@yevgenypats Could you please update your ETA to look at this PR? Please let us know if there is any concerns with the current approach |
ISSUE
cloudquery/cloudquery#14590
Right now the behaviour of overlapping syncs of same resource is undefined as they always refer to the same primary keys while writing
PROPOSAL
Accept a custom hash salt as a source spec which can be used as a custom salt while generating deterministic cq_ids.
This way we can get different cq_ids across different syncs even for the same resource.
This holds true when we use deterministic_cq_id: true && pk_mode: _cq_id_only
TODO
Need to update source spec and other parts of Plugin SDKs to support this