Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract upstream package before matching #607

Merged
merged 8 commits into from Feb 10, 2022
Merged

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Jan 26, 2022

This PR makes the following adjustments:

  • attempts to extract upstream OS package information from pURLs when rich metadata is not available, which aids in package indirect matching for APK, RPM, and DPKG package types. based on PR review, this is being moved further back in processing into syft, probably in Improve SPDX decoding functionality syft#738.
  • adds the concept of an "upstream package" with a name and optional version. Any package can now indicate one or more source packages via pkg.Package.Upstreams[].
  • Source package information is extracted from metadata in advance of the matching process. This is better for two reasons: 1) we can remove multiple pkg.*Metadata structs since they can be represented as pkg.UpstreamPackage with no loss of information, and 2) simplifies the matching process for matchers that do matching by source-package indirection by sharing similar code (since they now share the same UpstreamPackage abstraction).
  • removes the use of the Syft pkg.MetadataType in favor of a new Grype pkg.Metadata type. This further enforces the bound context between Syft and Grype and in this case is better since the MetadataType hints at the shape of the pkg.Package.Metadata field, which no longer has any shared struct definitions with Syft.
  • exposes MetadataType in the json output

Fixes #395

@wagoodman wagoodman added the enhancement New feature or request label Jan 26, 2022
@wagoodman wagoodman requested a review from a team January 26, 2022 21:18
@wagoodman wagoodman self-assigned this Jan 26, 2022
grype/pkg/package.go Outdated Show resolved Hide resolved
const (
// this is the full set of data shapes that can be represented within the pkg.Package.Metadata field

UnknownMetadataType MetadataType = "UnknownMetadata"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just serialized during output based on the actual metadata data type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how this originally worked, yes. I decided to make a more firm division of responsibilities here between syft and grype since the data shapes represented by these constants are different depending which application being referred to.

Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I don't see anything wrong here but a few general questions:

  1. Is there a reason to use MetadataType instead of relying on the Go types, like using:
switch metadata.(type) {
case JavaMetadata:
...
}

This tripped me up a bit and I had to debug a small amount to realize just setting the metadata was not enough. I could not really understand why a variable is used to denote a type when the type information is already available (and could be serialized/deserialized without issue based on the Go type, I suspect).

  1. Would it be better to keep all the PURL parsing in the Syft decoding process? It seems wrong that we might generate a format and not be able to decode it in a meaningful manner but then handle it in Grype. That said, I don't really see a problem having some extra handling to just try to handle whatever we've been given. For Syft, SPDX, CycloneDX and any other supported formats, I would hope we're decoding these properly and won't need it, though.

  2. Is the "upstream" abstraction shared across enough matchers that it should be top-level data? It seems as though it is specific to Apk, Dpkg, and Rpm matchers but not the others. Looking at the dataFromPkg function, it seems like it's really all just different types of metadata much like it was factored before. Might the upstreams list just replace the specific metadata types for those matchers instead of being a top-level thing? Or keep each of the metadata types but add a list of upstreams so the functions to deal with them can remain consistent for the aforementioned 3 matchers?

// version = "2.17.2" (or, if there's an epoch, we'd expect a value like "4:2.17.2")
// release = "12.28.el6_9.2"
// arch = "src"
var rpmPackageNamePattern = regexp.MustCompile(`^(?P<name>.*)-(?P<version>.*)-(?P<release>.*)\.(?P<arch>[a-zA-Z][^.]+)(\.rpm)$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are a lot of type specific things added here; maybe this should predominantly be factored out to the subpackages under the matcher package or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are a lot of type specific things added here

Indeed, this was intentional. The observation was that by moving type specific handling as much as possible upstream in processing (in this case always during package creation) downstream processing becomes simplified.

The reason for the type-specific logic is to craft the UpstreamPackage (which is agnostic to the package type) as early as possible. Today (before this PR) we push a lot of that processing into matcher sub-packages, which was causing lots of type assertions to get specific upstream package info out in order to do a fairly standard DB lookup. It became apparent that making the Upstream Package agnostic to the package type (information is inferred from the parent package) made it easier to do this logic when we are creating the Grype Package based off of the Syft package just after SBOM decoding or Syft cataloging (thereby making downstream processing in the matchers simpler).

I don't think I'd be able to put the rpm-specific logic for parsing into the RPM matcher package since that would introduce a package cycle. However, I could move this logic into another new package, but it isn't clear what that new package's purpose (thus name) would be other than something like rpm-utils. For that reason I left this logic unexported in the pkg package (closest to where it's used).

@wagoodman
Copy link
Contributor Author

wagoodman commented Jan 28, 2022

@kzantow :

  1. Would it be better to keep all the PURL parsing in the Syft decoding process?...

absolutely, I'll make that update (good call)

  1. ... Is the "upstream" abstraction shared across enough matchers that it should be top-level data? It seems as though it is specific to Apk, Dpkg, and Rpm matchers but not the others ...

It's true that this abstraction is not universal to all package types in grype, however, it is only these ecosystems (+nvd) where grype cares about looking for upstream packages. It may be one day that we add matching for additional upstream packages for other matchers, but we just haven't done so yet.

  1. Looking at the dataFromPkg function, it seems like it's really all just different types of metadata much like it was factored before.

Yes, however, the upstreams processing is new and metadata processing for DPKG and APK was removed. This functionality was decomposed into a separate function was all.

  1. ...Might the upstreams list just replace the specific metadata types for those matchers instead of being a top-level thing?...

That's exactly what this change aims to do. This PR already removes all of the package metadata's where possible (DPKG and APK metadata structs were flat out deleted). I could have gone further and removed the metadata for RPMdb, I can give that a shot and see what shakes out? I think this would mean a Syft update (your PR would be a good spot, I can add a commit?). I didn't look at the possibility of removing the Java metadata struct, but I feel that could be reserved for it's own PR.

@wagoodman wagoodman requested a review from a team February 1, 2022 19:52
@wagoodman wagoodman changed the title Support source packages from pURLs Abstract upstream package before matching Feb 1, 2022
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I had a thought I'll bring up elsewhere about handling the metadata across things in a more uniform manner, I think this moves toward that goal 👍

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman force-pushed the add-search-by-purl branch 2 times, most recently from 8a73518 to 34df3b3 Compare February 10, 2022 18:58
@wagoodman
Copy link
Contributor Author

wagoodman commented Feb 10, 2022

@kzantow I've made a couple of additions to the PR after the initial review, the diff for just these changes is here: https://github.com/anchore/grype/pull/607/files/816944a06810eb92ebc8121ee74b58f4b0bcf860..8dca42152af350d9fec8f2f5bf0800dffb7aa6d4

Additions:

  • Added a test to ensure there is matching parity for syft generated SBOMs of supported formats (Syft json, SPDX json, and SPDX tag-value)
  • Updated the java namespace generator function to additionally extract groupID and artifactID from the package pURL, which is useful in cases where we cannot recreate the Syft-specific metadata.
  • Since SPDX doesn't support source encoding, we no longer bail when processing a source of an unknown type for the JSON presenter.

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -16,6 +16,8 @@ SUCCESS := $(BOLD)$(GREEN)
# the quality gate lower threshold for unit test total % coverage (by function statements)
COVERAGE_THRESHOLD := 47
BOOTSTRAP_CACHE="c7afb99ad"
INTEGRATION_CACHE_BUSTER="894d8ca"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this cache busting for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported this functionality from syft --essentially gives us the ability to bust the CI cache ourselves instead of needing to change the underlying integration test fixtures fictitiously. In this case I wanted to ensure that the images used for integration tests was refreshed.

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@wagoodman wagoodman enabled auto-merge (squash) February 10, 2022 21:39
@wagoodman wagoodman merged commit c9f2716 into main Feb 10, 2022
@wagoodman wagoodman deleted the add-search-by-purl branch February 10, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for ingestion of SPDX SBOM documents as input
2 participants