Abstract upstream package before matching #607

wagoodman · 2022-01-26T21:18:55Z

This PR makes the following adjustments:

~~attempts to extract upstream OS package information from pURLs when rich metadata is not available, which aids in package indirect matching for APK, RPM, and DPKG package types.~~ based on PR review, this is being moved further back in processing into syft, probably in Improve SPDX decoding functionality syft#738.
adds the concept of an "upstream package" with a name and optional version. Any package can now indicate one or more source packages via pkg.Package.Upstreams[].
Source package information is extracted from metadata in advance of the matching process. This is better for two reasons: 1) we can remove multiple pkg.*Metadata structs since they can be represented as pkg.UpstreamPackage with no loss of information, and 2) simplifies the matching process for matchers that do matching by source-package indirection by sharing similar code (since they now share the same UpstreamPackage abstraction).
removes the use of the Syft pkg.MetadataType in favor of a new Grype pkg.Metadata type. This further enforces the bound context between Syft and Grype and in this case is better since the MetadataType hints at the shape of the pkg.Package.Metadata field, which no longer has any shared struct definitions with Syft.
exposes MetadataType in the json output

Fixes #395

grype/pkg/package.go

kzantow · 2022-01-27T21:18:23Z

grype/pkg/metadata.go

+const (
+	// this is the full set of data shapes that can be represented within the pkg.Package.Metadata field
+
+	UnknownMetadataType MetadataType = "UnknownMetadata"


Should this be just serialized during output based on the actual metadata data type?

That's how this originally worked, yes. I decided to make a more firm division of responsibilities here between syft and grype since the data shapes represented by these constants are different depending which application being referred to.

grype/pkg/upstream_package.go

kzantow

Overall I don't see anything wrong here but a few general questions:

Is there a reason to use MetadataType instead of relying on the Go types, like using:

switch metadata.(type) {
case JavaMetadata:
...
}

This tripped me up a bit and I had to debug a small amount to realize just setting the metadata was not enough. I could not really understand why a variable is used to denote a type when the type information is already available (and could be serialized/deserialized without issue based on the Go type, I suspect).

Would it be better to keep all the PURL parsing in the Syft decoding process? It seems wrong that we might generate a format and not be able to decode it in a meaningful manner but then handle it in Grype. That said, I don't really see a problem having some extra handling to just try to handle whatever we've been given. For Syft, SPDX, CycloneDX and any other supported formats, I would hope we're decoding these properly and won't need it, though.
Is the "upstream" abstraction shared across enough matchers that it should be top-level data? It seems as though it is specific to Apk, Dpkg, and Rpm matchers but not the others. Looking at the dataFromPkg function, it seems like it's really all just different types of metadata much like it was factored before. Might the upstreams list just replace the specific metadata types for those matchers instead of being a top-level thing? Or keep each of the metadata types but add a list of upstreams so the functions to deal with them can remain consistent for the aforementioned 3 matchers?

kzantow · 2022-01-27T22:03:26Z

grype/pkg/package.go

+//		version = "2.17.2" (or, if there's an epoch, we'd expect a value like "4:2.17.2")
+//		release = "12.28.el6_9.2"
+//		arch = "src"
+var rpmPackageNamePattern = regexp.MustCompile(`^(?P<name>.*)-(?P<version>.*)-(?P<release>.*)\.(?P<arch>[a-zA-Z][^.]+)(\.rpm)$`)


It seems like there are a lot of type specific things added here; maybe this should predominantly be factored out to the subpackages under the matcher package or something similar?

It seems like there are a lot of type specific things added here

Indeed, this was intentional. The observation was that by moving type specific handling as much as possible upstream in processing (in this case always during package creation) downstream processing becomes simplified.

The reason for the type-specific logic is to craft the UpstreamPackage (which is agnostic to the package type) as early as possible. Today (before this PR) we push a lot of that processing into matcher sub-packages, which was causing lots of type assertions to get specific upstream package info out in order to do a fairly standard DB lookup. It became apparent that making the Upstream Package agnostic to the package type (information is inferred from the parent package) made it easier to do this logic when we are creating the Grype Package based off of the Syft package just after SBOM decoding or Syft cataloging (thereby making downstream processing in the matchers simpler).

I don't think I'd be able to put the rpm-specific logic for parsing into the RPM matcher package since that would introduce a package cycle. However, I could move this logic into another new package, but it isn't clear what that new package's purpose (thus name) would be other than something like rpm-utils. For that reason I left this logic unexported in the pkg package (closest to where it's used).

wagoodman · 2022-01-28T17:40:41Z

@kzantow :

Would it be better to keep all the PURL parsing in the Syft decoding process?...

absolutely, I'll make that update (good call)

... Is the "upstream" abstraction shared across enough matchers that it should be top-level data? It seems as though it is specific to Apk, Dpkg, and Rpm matchers but not the others ...

It's true that this abstraction is not universal to all package types in grype, however, it is only these ecosystems (+nvd) where grype cares about looking for upstream packages. It may be one day that we add matching for additional upstream packages for other matchers, but we just haven't done so yet.

Looking at the dataFromPkg function, it seems like it's really all just different types of metadata much like it was factored before.

Yes, however, the upstreams processing is new and metadata processing for DPKG and APK was removed. This functionality was decomposed into a separate function was all.

...Might the upstreams list just replace the specific metadata types for those matchers instead of being a top-level thing?...

That's exactly what this change aims to do. This PR already removes all of the package metadata's where possible (DPKG and APK metadata structs were flat out deleted). I could have gone further and removed the metadata for RPMdb, I can give that a shot and see what shakes out? I think this would mean a Syft update (your PR would be a good spot, I can add a commit?). I didn't look at the possibility of removing the Java metadata struct, but I feel that could be reserved for it's own PR.

kzantow

LGTM - I had a thought I'll bring up elsewhere about handling the metadata across things in a more uniform manner, I think this moves toward that goal 👍

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

wagoodman · 2022-02-10T19:03:32Z

@kzantow I've made a couple of additions to the PR after the initial review, the diff for just these changes is here: https://github.com/anchore/grype/pull/607/files/816944a06810eb92ebc8121ee74b58f4b0bcf860..8dca42152af350d9fec8f2f5bf0800dffb7aa6d4

Additions:

Added a test to ensure there is matching parity for syft generated SBOMs of supported formats (Syft json, SPDX json, and SPDX tag-value)
Updated the java namespace generator function to additionally extract groupID and artifactID from the package pURL, which is useful in cases where we cannot recreate the Syft-specific metadata.
Since SPDX doesn't support source encoding, we no longer bail when processing a source of an unknown type for the JSON presenter.

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

kzantow

LGTM

kzantow · 2022-02-10T18:59:31Z

Makefile

@@ -16,6 +16,8 @@ SUCCESS := $(BOLD)$(GREEN)
 # the quality gate lower threshold for unit test total % coverage (by function statements)
 COVERAGE_THRESHOLD := 47
 BOOTSTRAP_CACHE="c7afb99ad"
+INTEGRATION_CACHE_BUSTER="894d8ca"


what is this cache busting for?

I ported this functionality from syft --essentially gives us the ability to bust the CI cache ourselves instead of needing to change the underlying integration test fixtures fictitiously. In this case I wanted to ensure that the images used for integration tests was refreshed.

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

wagoodman added the enhancement New feature or request label Jan 26, 2022

wagoodman requested a review from a team January 26, 2022 21:18

wagoodman self-assigned this Jan 26, 2022

kzantow reviewed Jan 27, 2022

View reviewed changes

grype/pkg/package.go Outdated Show resolved Hide resolved

kzantow reviewed Jan 27, 2022

View reviewed changes

grype/pkg/upstream_package.go Outdated Show resolved Hide resolved

kzantow reviewed Jan 27, 2022

View reviewed changes

wagoodman mentioned this pull request Feb 1, 2022

Improve SPDX decoding functionality anchore/syft#738

Merged

wagoodman requested a review from a team February 1, 2022 19:52

wagoodman changed the title ~~Support source packages from pURLs~~ Abstract upstream package before matching Feb 1, 2022

wagoodman force-pushed the add-search-by-purl branch from 7b60580 to e62f69a Compare February 9, 2022 19:16

kzantow approved these changes Feb 9, 2022

View reviewed changes

wagoodman added 6 commits February 10, 2022 13:53

add metadata extraction from pURLs

c419208

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

extract upstream packages before matching

50307b3

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

put pkg.UpstreamPackages under test

7e493a2

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

remove pURL related processing

5681b57

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

pull in syft spdx decoding

816944a

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

allow for more flexible GHSA namespace and source extraction

e1a1444

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

wagoodman force-pushed the add-search-by-purl branch 2 times, most recently from 8a73518 to 34df3b3 Compare February 10, 2022 18:58

add matching parity integration tests for all supported formats

8dca421

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

wagoodman force-pushed the add-search-by-purl branch from 34df3b3 to 8dca421 Compare February 10, 2022 19:04

kzantow approved these changes Feb 10, 2022

View reviewed changes

bump syft to get spdx tv fix

b7ebc9b

Signed-off-by: Alex Goodman <alex.goodman@anchore.com>

wagoodman enabled auto-merge (squash) February 10, 2022 21:39

wagoodman merged commit c9f2716 into main Feb 10, 2022

wagoodman deleted the add-search-by-purl branch February 10, 2022 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract upstream package before matching #607

Abstract upstream package before matching #607

wagoodman commented Jan 26, 2022 •

edited

kzantow Jan 27, 2022

wagoodman Jan 28, 2022

kzantow left a comment

kzantow Jan 27, 2022

wagoodman Jan 28, 2022

wagoodman commented Jan 28, 2022 •

edited

kzantow left a comment

wagoodman commented Feb 10, 2022 •

edited

kzantow left a comment

kzantow Feb 10, 2022

wagoodman Feb 10, 2022

Abstract upstream package before matching #607

Abstract upstream package before matching #607

Conversation

wagoodman commented Jan 26, 2022 • edited

kzantow Jan 27, 2022

Choose a reason for hiding this comment

wagoodman Jan 28, 2022

Choose a reason for hiding this comment

kzantow left a comment

Choose a reason for hiding this comment

kzantow Jan 27, 2022

Choose a reason for hiding this comment

wagoodman Jan 28, 2022

Choose a reason for hiding this comment

wagoodman commented Jan 28, 2022 • edited

kzantow left a comment

Choose a reason for hiding this comment

wagoodman commented Feb 10, 2022 • edited

kzantow left a comment

Choose a reason for hiding this comment

kzantow Feb 10, 2022

Choose a reason for hiding this comment

wagoodman Feb 10, 2022

Choose a reason for hiding this comment

wagoodman commented Jan 26, 2022 •

edited

wagoodman commented Jan 28, 2022 •

edited

wagoodman commented Feb 10, 2022 •

edited