Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(misconf): Improve cause performance #6586

Merged
merged 2 commits into from May 3, 2024
Merged

perf(misconf): Improve cause performance #6586

merged 2 commits into from May 3, 2024

Conversation

simar7
Copy link
Member

@simar7 simar7 commented May 1, 2024

Description

We only need to get the offending cause if the result is a failure.

Today we end up getting a cause for every single result type, which results in unnecessary compute being done.

As a side note, I also verified that the JSON output (when including the --include-non-failures flag) also does not contain any info on cause (for code excerpts) for results that are a PASS.

Related issues

Related PRs

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

We only need to get the offending cause if the result is a failure.

Signed-off-by: Simar <simar@linux.com>
@simar7 simar7 self-assigned this May 1, 2024
@simar7
Copy link
Member Author

simar7 commented May 1, 2024

There's probably still room for improvement here as ultimately it doesn't address the fact that we will still end up parsing the required files that are responsible for causing failures. We also do quite a bit of string manipulation with raw file content as an example which further could be improved. To some extent this is necessary as it is part of the PostAnalyze step.

My benchmark so far to evaluate this as an improvement has scanning the minikube repo. It's a fairly large repo with a lot of things to scan for.

Previously the scan didn't not finish (in a reasonable time) as shown in the issue this PR resolves but after this change, on my setup the scan takes on average 2min to finish and takes roughly ~35-40MB of memory.

@knqyf263
Copy link
Collaborator

knqyf263 commented May 1, 2024

Previously the scan didn't not finish (in a reasonable time) as shown in the issue this PR resolves but after this change, on my setup the scan takes on average 2min to finish and takes roughly ~35-40MB of memory.

Great! I have one more question. Even if successful files were processed, would parsing a plaintext file use 9.5 GB of memory? JSON and YAML files are usually a few megabytes at most, so it is doubtful they would consume that much memory unless there is a memory leak. Are there any huge plaintext files in Minikube?

@simar7
Copy link
Member Author

simar7 commented May 2, 2024

Are there any huge plaintext files in Minikube?

Yes, see here it's around 7MB.

Without this file, the entire scan finishes in 30s.

As far as processing is concerned, we do quite a bit of the internally. There are a couple of places we can improve on for instance:

  1. We currently return a reference back to a local variable here. This ends up escaping on to the heap. This also contains the file content. We should return by value so at least we aren't allocating space for this local variable on the heap. Regardless, since we have to output stdout (or a file), we keep the results in memory until the entire scan is finished.
  2. We ultimately have to read the entire file for every result here, since there can be multiple checks that get flagged within the same file.
    With smaller files this isn't an issue but with large files like the one above, it becomes slow as we're using string manipulation. I'd like to know if you have any ideas on how we can improve this part.
  3. In general, we should also refactor this function. It's quite complex as it is and not easy to read.

As for point 2 above, I would say such situations of big files can occur in other repos as well and the users might not realize. Therefore adding the option to disable causes completely (in addition to only running them for failures which is this PR), can help until we have a better way to solve point 2.

@simar7 simar7 marked this pull request as ready for review May 2, 2024 04:22
@simar7 simar7 requested a review from knqyf263 as a code owner May 2, 2024 04:22
@simar7
Copy link
Member Author

simar7 commented May 2, 2024

@nikpivkin since you are back, I'd like to welcome any ideas if you have any as well.

@simar7 simar7 requested a review from nikpivkin May 2, 2024 04:26
@knqyf263
Copy link
Collaborator

knqyf263 commented May 2, 2024

Yes, see here it's around 7MB.

I think the file is not huge—less than 10 MB is small. Does parsing 7 MB of YAML consume several GB of memory? Even if there were a large number of 7 MB YAML files in a repository, if they were processed linearly, the memory consumption would not be that high. Is there another factor in memory consumption?

We currently return a reference back to a local variable here. This ends up escaping on to the heap. This also contains the file content. We should return by value so at least we aren't allocating space for this local variable on the heap. Regardless, since we have to output stdout (or a file), we keep the results in memory until the entire scan is finished.

I took a quick look at GetCode. It looks like the local variable contains the relevant lines only, not the entire content. Why is it so relevant? I'm just curious.

@simar7
Copy link
Member Author

simar7 commented May 2, 2024

Yes, see here it's around 7MB.

I think the file is not huge—less than 10 MB is small. Does parsing 7 MB of YAML consume several GB of memory? Even if there were a large number of 7 MB YAML files in a repository, if they were processed linearly, the memory consumption would not be that high. Is there another factor in memory consumption?

I also agree with you. I will keep looking.

@simar7
Copy link
Member Author

simar7 commented May 2, 2024

Here's the profile data if anyone's interested in taking a look:
profile.pb.gz

image image

@nikpivkin
Copy link
Contributor

@simar7 This solves the memory usage problem.

// rawLines := strings.Split(string(content), "\n")
var rawLines []string
bs := bufio.NewScanner(bytes.NewReader(content))
for bs.Scan() {
  rawLines = append(rawLines, bs.Text())
}

if bs.Err() != nil {
  return nil, fmt.Errorf("failed to scan file : %w", err)
}

@simar7
Copy link
Member Author

simar7 commented May 2, 2024

@simar7 This solves the memory usage problem.

// rawLines := strings.Split(string(content), "\n")
var rawLines []string
bs := bufio.NewScanner(bytes.NewReader(content))
for bs.Scan() {
  rawLines = append(rawLines, bs.Text())
}

if bs.Err() != nil {
  return nil, fmt.Errorf("failed to scan file : %w", err)
}

@nikpivkin that does help! The heap does grow over time but not at the same rate as before. Settles around ~800MB See below

Before

image

After

image

The scan (with your patch but without this PR's changes) took around 20min on my machine to finish.

With my changes + your patch, it takes around 2.5min, along with using less memory.

I'll update this PR to add your changes in as I think doing both 1) getting cause only for failures 2) using a bytes.NewReader will both help.

@knqyf263
Copy link
Collaborator

knqyf263 commented May 2, 2024

I tried not to load the entire content into memory, but only the necessary lines.
knqyf263@f365780

It reduces memory consumption.

Before (3483943)

CleanShot 2024-05-02 at 14 07 24

After (knqyf263@f365780)

CleanShot 2024-05-02 at 14 08 07

It will be effective when processing 50MB, 100MB files or even bigger files, but it is unlikely that there will be JSON or YAML that large 😆 . This may be premature optimisation.

@simar7
Copy link
Member Author

simar7 commented May 2, 2024

I tried not to load the entire content into memory, but only the necessary lines. knqyf263@f365780

It reduces memory consumption.

Before (3483943)

CleanShot 2024-05-02 at 14 07 24

After (knqyf263@f365780)

CleanShot 2024-05-02 at 14 08 07

It will be effective when processing 50MB, 100MB files or even bigger files, but it is unlikely that there will be JSON or YAML that large 😆 . This may be premature optimisation.

Impressive! Do you think we should add this change in as well? I don't have a strong opinion either way.

@knqyf263
Copy link
Collaborator

knqyf263 commented May 3, 2024

If we see a memory issue again, we can come back to my patch.

@knqyf263
Copy link
Collaborator

knqyf263 commented May 3, 2024

@simar7 Can we merge the PR now so it will be included in v0.51.0?

@simar7
Copy link
Member Author

simar7 commented May 3, 2024

@simar7 Can we merge the PR now so it will be included in v0.51.0?

Okay sounds good!

@simar7 simar7 added this pull request to the merge queue May 3, 2024
Merged via the queue into main with commit 770b141 May 3, 2024
12 checks passed
@simar7 simar7 deleted the improve-cause-perf branch May 3, 2024 05:29
fl0pp5 pushed a commit to altlinux/trivy that referenced this pull request May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf(misconf): High memory usage (9.5 GB) and long scan time (45 min) on some repos
3 participants