Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dont merge to main yet] Don't flush after step completes #977

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Gonzalo-Avalos-Ribas
Copy link
Contributor

@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas commented Oct 5, 2023

The motivation for this PR is to avoid flushing every time we complete a step. There should only be two times we flush to disk/upload:

  1. When we reach the graphObjectBufferThresholdInBytes of data in memory
  2. All steps are done.

Also, I don't think is optimal to divide by step the data we update, that would generate way more unnecessary uploads.

Tried it on dev, on instance jupiterone-integration-dev: We went from 432 uploads in a single job, to 3.

@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas requested a review from a team as a code owner October 5, 2023 21:00
@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas marked this pull request as draft October 5, 2023 22:02
@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas changed the title Don't flush after steps Don't flush after step completes Oct 5, 2023
name: IntegrationErrorEventName.UnexpectedError,
description: 'Upload to persister failed',
});
//How can we fail gracefully here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today if an upload fails - we fail that step, but its really an incorrect decision, since the upload usually has information regarding other steps as well. Should we just fail the job if an upload fails? Note this is not the only place we would be uploading, but we can also change the behavior there.

Comment on lines +247 to +250
await this.lockOperation(async () => {
const entitiesByStep = this.localGraphObjectStore.collectEntitiesByStep();
let entitiesToUpload: Entity[] = [];
for (const [stepId, entities] of entitiesByStep) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we move this from pMap to a plain function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be no reason to divide the uploads we do. There might be a point of doing it for writes to disk.
Let's say a flushed batch of 6MB of entities gets into the function, do we want to generate 100 uploads if the entities come from 100 different steps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. In that case, we may want to alter the way this whole flushing process is done in the future. We could just remove the concept of steps in this. That can be future work though.

@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas marked this pull request as ready for review October 9, 2023 18:28
@zemberdotnet
Copy link
Member

Looks good to me. Let's make an alpha version.

@Gonzalo-Avalos-Ribas Gonzalo-Avalos-Ribas changed the title Don't flush after step completes [dont merge to main yet] Don't flush after step completes Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants