perf(misconf): Improve cause performance #6586

simar7 · 2024-05-01T07:03:46Z

Description

We only need to get the offending cause if the result is a failure.

Today we end up getting a cause for every single result type, which results in unnecessary compute being done.

As a side note, I also verified that the JSON output (when including the --include-non-failures flag) also does not contain any info on cause (for code excerpts) for results that are a PASS.

Related issues

Close perf(misconf): High memory usage (9.5 GB) and long scan time (45 min) on some repos #6557

Related PRs

feat(misconf): Add --disable-causes flag #6585

Checklist

I've read the guidelines for contributing to this repository.
I've followed the conventions in the PR title.
I've added tests that prove my fix is effective or that my feature works.
I've updated the documentation with the relevant information (if needed).
I've added usage information (if the PR introduces new options)
I've included a "before" and "after" example to the description (if the PR is a user interface change).

We only need to get the offending cause if the result is a failure. Signed-off-by: Simar <simar@linux.com>

simar7 · 2024-05-01T07:09:35Z

There's probably still room for improvement here as ultimately it doesn't address the fact that we will still end up parsing the required files that are responsible for causing failures. We also do quite a bit of string manipulation with raw file content as an example which further could be improved. To some extent this is necessary as it is part of the PostAnalyze step.

My benchmark so far to evaluate this as an improvement has scanning the minikube repo. It's a fairly large repo with a lot of things to scan for.

Previously the scan didn't not finish (in a reasonable time) as shown in the issue this PR resolves but after this change, on my setup the scan takes on average 2min to finish and takes roughly ~35-40MB of memory.

knqyf263 · 2024-05-01T12:10:24Z

Previously the scan didn't not finish (in a reasonable time) as shown in the issue this PR resolves but after this change, on my setup the scan takes on average 2min to finish and takes roughly ~35-40MB of memory.

Great! I have one more question. Even if successful files were processed, would parsing a plaintext file use 9.5 GB of memory? JSON and YAML files are usually a few megabytes at most, so it is doubtful they would consume that much memory unless there is a memory leak. Are there any huge plaintext files in Minikube?

simar7 · 2024-05-02T04:22:34Z

Are there any huge plaintext files in Minikube?

Yes, see here it's around 7MB.

Without this file, the entire scan finishes in 30s.

As far as processing is concerned, we do quite a bit of the internally. There are a couple of places we can improve on for instance:

We currently return a reference back to a local variable here. This ends up escaping on to the heap. This also contains the file content. We should return by value so at least we aren't allocating space for this local variable on the heap. Regardless, since we have to output stdout (or a file), we keep the results in memory until the entire scan is finished.
We ultimately have to read the entire file for every result here, since there can be multiple checks that get flagged within the same file.
With smaller files this isn't an issue but with large files like the one above, it becomes slow as we're using string manipulation. I'd like to know if you have any ideas on how we can improve this part.
In general, we should also refactor this function. It's quite complex as it is and not easy to read.

As for point 2 above, I would say such situations of big files can occur in other repos as well and the users might not realize. Therefore adding the option to disable causes completely (in addition to only running them for failures which is this PR), can help until we have a better way to solve point 2.

simar7 · 2024-05-02T04:26:12Z

@nikpivkin since you are back, I'd like to welcome any ideas if you have any as well.

knqyf263 · 2024-05-02T04:55:50Z

Yes, see here it's around 7MB.

I think the file is not huge—less than 10 MB is small. Does parsing 7 MB of YAML consume several GB of memory? Even if there were a large number of 7 MB YAML files in a repository, if they were processed linearly, the memory consumption would not be that high. Is there another factor in memory consumption?

We currently return a reference back to a local variable here. This ends up escaping on to the heap. This also contains the file content. We should return by value so at least we aren't allocating space for this local variable on the heap. Regardless, since we have to output stdout (or a file), we keep the results in memory until the entire scan is finished.

I took a quick look at GetCode. It looks like the local variable contains the relevant lines only, not the entire content. Why is it so relevant? I'm just curious.

simar7 · 2024-05-02T05:17:19Z

Yes, see here it's around 7MB.

I think the file is not huge—less than 10 MB is small. Does parsing 7 MB of YAML consume several GB of memory? Even if there were a large number of 7 MB YAML files in a repository, if they were processed linearly, the memory consumption would not be that high. Is there another factor in memory consumption?

I also agree with you. I will keep looking.

simar7 · 2024-05-02T05:37:53Z

Here's the profile data if anyone's interested in taking a look:
profile.pb.gz

nikpivkin · 2024-05-02T05:43:20Z

@simar7 This solves the memory usage problem.

// rawLines := strings.Split(string(content), "\n")
var rawLines []string
bs := bufio.NewScanner(bytes.NewReader(content))
for bs.Scan() {
  rawLines = append(rawLines, bs.Text())
}

if bs.Err() != nil {
  return nil, fmt.Errorf("failed to scan file : %w", err)
}

simar7 · 2024-05-02T06:43:47Z

@simar7 This solves the memory usage problem.

// rawLines := strings.Split(string(content), "\n")
var rawLines []string
bs := bufio.NewScanner(bytes.NewReader(content))
for bs.Scan() {
  rawLines = append(rawLines, bs.Text())
}

if bs.Err() != nil {
  return nil, fmt.Errorf("failed to scan file : %w", err)
}

@nikpivkin that does help! The heap does grow over time but not at the same rate as before. Settles around ~800MB See below

Before

After

The scan (with your patch but without this PR's changes) took around 20min on my machine to finish.

With my changes + your patch, it takes around 2.5min, along with using less memory.

I'll update this PR to add your changes in as I think doing both 1) getting cause only for failures 2) using a bytes.NewReader will both help.

knqyf263 · 2024-05-02T10:09:48Z

I tried not to load the entire content into memory, but only the necessary lines.
knqyf263@f365780

It reduces memory consumption.

Before (`3483943`)

After (knqyf263@`f365780`)

It will be effective when processing 50MB, 100MB files or even bigger files, but it is unlikely that there will be JSON or YAML that large 😆 . This may be premature optimisation.

simar7 · 2024-05-02T20:55:40Z

I tried not to load the entire content into memory, but only the necessary lines. knqyf263@f365780

It reduces memory consumption.

Before (3483943)

After (knqyf263@f365780)

It will be effective when processing 50MB, 100MB files or even bigger files, but it is unlikely that there will be JSON or YAML that large 😆 . This may be premature optimisation.

Impressive! Do you think we should add this change in as well? I don't have a strong opinion either way.

knqyf263 · 2024-05-03T04:01:31Z

If we see a memory issue again, we can come back to my patch.

knqyf263 · 2024-05-03T04:47:56Z

@simar7 Can we merge the PR now so it will be included in v0.51.0?

simar7 · 2024-05-03T05:03:52Z

@simar7 Can we merge the PR now so it will be included in v0.51.0?

Okay sounds good!

Signed-off-by: Simar <simar@linux.com>

perf(misconf): Improve cause performance

349eeb7

We only need to get the offending cause if the result is a failure. Signed-off-by: Simar <simar@linux.com>

simar7 self-assigned this May 1, 2024

knqyf263 mentioned this pull request May 1, 2024

feat(misconf): Add --disable-causes flag #6585

Draft

6 tasks

simar7 marked this pull request as ready for review May 2, 2024 04:22

simar7 requested a review from knqyf263 as a code owner May 2, 2024 04:22

simar7 requested a review from nikpivkin May 2, 2024 04:26

use a bytes.NewReader instead

3483943

knqyf263 approved these changes May 3, 2024

View reviewed changes

simar7 added this pull request to the merge queue May 3, 2024

Merged via the queue into main with commit 770b141 May 3, 2024
12 checks passed

simar7 deleted the improve-cause-perf branch May 3, 2024 05:29

fl0pp5 pushed a commit to altlinux/trivy that referenced this pull request May 6, 2024

perf(misconf): Improve cause performance (aquasecurity#6586)

8311f8b

Signed-off-by: Simar <simar@linux.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(misconf): Improve cause performance #6586

perf(misconf): Improve cause performance #6586

simar7 commented May 1, 2024

simar7 commented May 1, 2024 •

edited

knqyf263 commented May 1, 2024

simar7 commented May 2, 2024 •

edited

simar7 commented May 2, 2024

knqyf263 commented May 2, 2024

simar7 commented May 2, 2024

simar7 commented May 2, 2024

nikpivkin commented May 2, 2024

simar7 commented May 2, 2024

knqyf263 commented May 2, 2024

simar7 commented May 2, 2024

Before (3483943)

After (knqyf263@f365780)

knqyf263 commented May 3, 2024

knqyf263 commented May 3, 2024

simar7 commented May 3, 2024

perf(misconf): Improve cause performance #6586

perf(misconf): Improve cause performance #6586

Conversation

simar7 commented May 1, 2024

Description

Related issues

Related PRs

Checklist

simar7 commented May 1, 2024 • edited

knqyf263 commented May 1, 2024

simar7 commented May 2, 2024 • edited

simar7 commented May 2, 2024

knqyf263 commented May 2, 2024

simar7 commented May 2, 2024

simar7 commented May 2, 2024

nikpivkin commented May 2, 2024

simar7 commented May 2, 2024

Before

After

knqyf263 commented May 2, 2024

Before (3483943)

After (knqyf263@f365780)

simar7 commented May 2, 2024

Before (3483943)

After (knqyf263@f365780)

knqyf263 commented May 3, 2024

knqyf263 commented May 3, 2024

simar7 commented May 3, 2024

simar7 commented May 1, 2024 •

edited

simar7 commented May 2, 2024 •

edited

Before (`3483943`)

After (knqyf263@`f365780`)