Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cataloging large images is taking too long #696

Closed
mverleun opened this issue Dec 15, 2021 · 8 comments · Fixed by #698
Closed

Cataloging large images is taking too long #696

mverleun opened this issue Dec 15, 2021 · 8 comments · Fixed by #698
Assignees
Labels
bug Something isn't working

Comments

@mverleun
Copy link

What happened:
When cataloging docker images using a script syft got stuck at one image and continued to consume quite a lot of CPU.
The image that caused trouble is gitlab/gitlab-ce:latest

What you expected to happen:
I would expect that syft would catalog this image as good as any other image.

How to reproduce it (as minimally and precisely as possible):
Download the image and catalog it:

docker pull gitlab/gitlab-ce:latest 
syft -vv -o json --file gitlab_gitlab-ce:latest.sbom.json gitlab/gitlab-ce:latest
...
[0000] DEBUG no new syft update available
[0000] DEBUG image: source=DockerDaemon location=gitlab/gitlab-ce:latest from-lib=stereoscope
[0083] DEBUG image metadata: digest=sha256:a2bf5ef04c22b5530d9c57aa5f20b55601c85fa2393d2d81e120d235f2a39ce4 mediaType=application/vnd.docker.distribution.manifest.v2+json tags=[gitlab/gitlab-ce:latest] from-lib=stereoscope
...
[0150] DEBUG cataloging with "apkdb-cataloger"
[0153] DEBUG discovered 0 packages
[0153] DEBUG cataloging with "go-module-binary-cataloger"
[0164] DEBUG discovered 1490 packages

<stuck here>

Anything else we need to know?:
Scanning the same image with grype goes well.

Environment:

@mverleun mverleun added the bug Something isn't working label Dec 15, 2021
@westonsteimel
Copy link
Contributor

This seems to happen while generating output in the json format specifically, which possibly explains why grype works since syft isn't actually persisting the output in that case

@westonsteimel
Copy link
Contributor

@wagoodman, just a random thought here, but is it possible it's getting into some sort of loop with relationships or something?

@luhring
Copy link
Contributor

luhring commented Dec 15, 2021

That's a good find! When I run with the default output format, it takes a bit of time (this is a large image, and Syft is thorough), but it definitely doesn't get stuck. I believe this started in v0.31.0.

@mverleun
Copy link
Author

It is indeed related to the JSON output format. Doing the same scan but this time with output in cyclonedx format goes well.

When reverting to version v0.30.1 the JSON output works fine for this image.

@luhring
Copy link
Contributor

luhring commented Dec 15, 2021

Thanks for confirming!

We're looking at this now. A current theory is a performance cost by our new hash-based method of IDing packages... 🤔

@mverleun
Copy link
Author

mverleun commented Dec 15, 2021 via email

@wagoodman wagoodman self-assigned this Dec 15, 2021
@wagoodman
Copy link
Contributor

wagoodman commented Dec 15, 2021

...is it possible it's getting into some sort of loop with relationships

@westonsteimel we've ruled out any infinite loops of relationships for this particular case, and I think the way that we use relationship objects internally would prevent an infinite loops here.

This seems to happen while generating output in the json format specifically...

Indeed! Something that the format.Encode() path for the JSON format and your earlier relationship hunch have in common is their use of Package.ID(), which seems to be the culprit here (as @luhring mentioned earlier):

Screen Shot 2021-12-15 at 12 15 23 PM

(I included only the section from the pprof graph that "stood out" since it was pretty large)

We recently stabilized the package ID based on the contents of what is in a package (#363), which leverages the hashstructure lib for dynamically computing the hash of the package object (and we use that hash as the ID of the object). I added in more features around relationship support, which utilizes package.ID() a lot more after the initial change (#607 and #634) . I think this is a worth while feature to keep for reasons os reproducing SBOMs more easily.

After chatting it through with the team I think the best path forward is to implement memoization around the package ID after a certain point in processing. We already have the convention that packages after a certain point in processing do not change, so this change would leverage that assumption.

I'll get this change in ASAP --Thanks for reporting @mverleun and for investigating @westonsteimel @luhring @spiffcs .

@JonZeolla
Copy link

We also are seeing an issue with this. A commit yesterday hung for over 2 hours before I stopped it. Unfortunately the logs aren't great quality, but that step essentially runs syft docker-archive:file -o spdx-json --file sbom.thing.spdx.json.

@wagoodman wagoodman changed the title Unable to catalog image gitlab/gitlab-ce:latest Cataloging large images is taking too long Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants