New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate SPDX-JSON relationships to SBOM model #634
Conversation
Benchmark Test ResultsBenchmark results from the latest changes vs base branch
|
7492f06
to
3333709
Compare
Merging #623 has caused a lot of merge conflicts and rebasing that is needed (note the base branch changed, so the diff is way off now). I'll adjust shortly. |
e117cc9
to
cde16e1
Compare
ok, rebased and cleaned up! 馃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass - Stopped at mimetype_helper
and will dig in more after lunch -
To answer questions on the PR:
-
I think we should implement backwards compatibility before bumping the schema version. It's easy to lose that thread and let changes get ahead of the opportunity for us to do it now.
-
If there is metadata from a package manager that suggests and file exists, but we have read the entire tree and concluded it doesn't exist because some future build action had removed it then including it in the document might cause a kind of SBOM false positive. On the other hand, if we support annotations I think we can include that message saying -
package manager says x should exist, but we concluded it has been removed
I think the incorrect path is including it with no context. Either do what is implemented now, or explore annotation land.
@@ -76,19 +90,12 @@ func documentNamespace(name string, srcMetadata source.Metadata) string { | |||
return path.Join(anchoreNamespace, identifier) | |||
} | |||
|
|||
func extractFromCatalog(catalog *pkg.Catalog) ([]model.Package, []model.File, []model.Relationship) { | |||
func toPackages(catalog *pkg.Catalog, relationships []artifact.Relationship) []model.Package { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new pattern for building packages.
As we build more relationships is this going to get more complex?
Right now relationships are homogenous in that they are all file relationships so fileIDSForPackage
is pretty clean. As we pass more and more types are we going to have to segment them out or will relationships always be this large slice we pass around and ask different parsing functions to figure out what they care about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing around a slice of relationships isn't a large problem --even if it's a lot of them, it's s pointer to an underlying array, so I feel pretty OK with this. When it comes to code complexity, I also feel like we're OK here... specifically fileIDSForPackage
is filtering out the set of relationships to only those that this package takes a part of and with the appropriate relationship type.
I do feel like there may be room for improvement as to where the "relationship helper function" go. This one is ad-hoc in a format-related package. That can be dealt with in the future after this PR.
return nil | ||
} | ||
|
||
mimeTypePrefix := strings.Split(metadata.MIMEType, "/")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo is this always consistent? Adding a note to come back and check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed --it will either match up with a prefix, or not. Either way there will be at least one value in the slice and we're taking the first value.
func lookupRelationship(ty artifact.RelationshipType) (bool, model.RelationshipType, string) { | ||
switch ty { | ||
case artifact.ContainsRelationship: | ||
return true, model.ContainsRelationship, "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add the comment here from the current documentation?
ContainsRelationship is to be used when SPDXRef-A contains SPDXRef-B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other
was highly suggested by the SPDX spec to have a comment, however, it doesn't seem like adding comments that copy-paste from the SPDX docs will help here (programmatically it does, but from a user/consumer sense it doesn't). Maybe in the future we could provide more context in the comments field, but for now should probably leave it out.
@@ -7,7 +7,7 @@ import ( | |||
"github.com/anchore/syft/syft/sbom" | |||
) | |||
|
|||
func encoder(output io.Writer, s sbom.SBOM) error { | |||
func encoder(output io.Writer, s sbom.SBOM, _ interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a smell that we updated the encoder signature to include this 3rd argument that only a couple of the types use?
Would it be better if this was not a function
type and instead an interface that a struct could implement so things specific to that struct can be injected into the method without polluting the larger signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great suggestion and points out a problem that we have already started to get rid of in the code base... I'll take a closer look at options here 馃憤
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I ended up adding an sbom.Descriptor
that is added to sbom.SBOM
that contains the application configuration, the tool name, and tool version.
I needed to refresh myself on the input going into the design for format encoding and decoding. The first approach taken (but not used) was to introduce Encoder
, Decoder
, and Validator
interfaces (I re-mocked it up here 5fb0235). This approach looked great, however, the drawback was that you couldn't leverage the pro's of this approach. Specifically, you could push format-specific fields (such as application config) down below the interface and let the concrete implementation deal with these difference for each instance. However, that requires that this data is passed at object construction or between construction and the call to Encode()
, neither of which is ideal since there would be a need to have a list of all formatters to reference... needing to have the data at this time would be difficult.
Function types that are assigned to the Format struct are the next best thing, since it enforces that state must be passed in at encode-time, and not earlier.
I then started to look the other way: since injection of this state via behavior seemed wrong (due to the above notes) I started looking for ways to encapsulate the app config. I had initially thrown out encoding it in sbom.SBOM
and instead started looking into making a new envelope (something like format.Context
that would contain sbom.SBOM
and the application config), however, the more I poked at this the more sane it seemed to put the application config into the SBOM itself.
In syft json, CycloneDX, and SPDX, there are sections that describe the tool that created the SBOM. Now that there are encode and decode workflows we must persist this information in the model or else we will loose it in decoding operations.
Consider an encode-decode-encode workflow with an old version of syft to a new version of syft... we encode the original SBOM with the old version of syft followed by a decode and encode with the new version of syft. This would be how you convert from an older schema version to a new schema version. Without persisting the original tool name, version, and configuration, we would have no record on how the information was created to begin with. It should only be the schema
section that changes (to indicate the new format) but not the tool information.
For these reasons I went the route of adding a new sbom.Descriptor
section that persists name, version, and optional configuration for the tool that created the SBOM.
Distro Distro `json:"distro"` // Distro represents the Linux distribution that was detected from the source | ||
Descriptor Descriptor `json:"descriptor"` // Descriptor is a block containing self-describing information about syft | ||
Schema Schema `json:"schema"` // Schema is a block reserved for defining the version for the shape of this JSON document and where to find the schema document to validate the shape | ||
Files []File `json:"files,omitempty"` // note: must have omitempty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's happening 馃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if we can sync and nail down how/why Files
should be a sibling field to Artifacts
- I think I understand, but it would be good to codify it somewhere so we can be very explicit on why they are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all File
Artifact
, but not all Artifact
File
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -26,6 +30,8 @@ func ToFormatModel(s sbom.SBOM, applicationConfig interface{}) model.Document { | |||
return model.Document{ | |||
Artifacts: toPackageModels(s.Artifacts.PackageCatalog), | |||
ArtifactRelationships: toRelationshipModel(s.Relationships), | |||
Files: toFile(s), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More to my point about nailing the distinction. A little weird that Files
takes the entire sbom, but Secrets
below it has a subsection from Artifacts
that we lift to the top level of this struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little weird that Files takes the entire sbom...
you're right, though it's still necessary. The short answer from our offline conversation is that the Artifacts
struct is not guaranteed to have all of the nodes in the graph that can be created from sbom.SBOM
, some of the edges in Relationships
describe nodes that a cataloger may not have directly observed (e.g. a package manager claiming that a package owns a file that does not exist on the filesystem).
At a minimum to describe all files one needs sbom.Artifacts
and sbom.Relationships
to get a full picture of all files that will be in the SBOM. Since this is a helper function of the sbom
package it seemed right from a consumer point of view to only need to pass an SBOM to get the answer to the question "what files are in my SBOM" (as opposed to needing to pass elements of an SBOM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments
To Review:
- to_format_model.go (each implementation)
- new mimetype_helper
- sbom.go coordinate additions
- CoordinateSet and its use
@@ -15,7 +15,7 @@ func TestEncodeDecodeCycle(t *testing.T) { | |||
originalSBOM := testutils.ImageInput(t, testImage) | |||
|
|||
var buf bytes.Buffer | |||
assert.NoError(t, encoder(&buf, originalSBOM)) | |||
assert.NoError(t, encoder(&buf, originalSBOM, map[string]string{"config": "value"})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a comment here? I think this ties into the config portion existing now within the encoder signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'll play with this while working on #634 (comment) ... at the very least I'll provide a comment
@@ -40,6 +46,85 @@ func ToFormatModel(s sbom.SBOM, applicationConfig interface{}) model.Document { | |||
} | |||
} | |||
|
|||
func toSecrets(data map[source.Coordinates][]file.SearchResult) []model.Secrets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map[source.Coordinates][]file.SearchResult
Is there any value in codifying these as concreate types beyond the assembly of their basic types?
Found as part of Sbom here:
Lines 17 to 25 in 4f00995
type Artifacts struct { | |
PackageCatalog *pkg.Catalog | |
FileMetadata map[source.Coordinates]source.FileMetadata | |
FileDigests map[source.Coordinates][]file.Digest | |
FileClassifications map[source.Coordinates][]file.Classification | |
FileContents map[source.Coordinates]string | |
Secrets map[source.Coordinates][]file.SearchResult | |
Distro *distro.Distro | |
} |
map[source.Coordinates][]file.SearchResult => someIndex
?
Only asking to see if additional methods could be added which makes working with these types easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our offline conversation: I'll try out making types for each of these "map" fields but probably won't add any helper functions for those types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the new types didn't make any behavioral changes --the new types at the very least abstract what is in the type so if it changes later (say to a struct) then we don't need to update a bunch of function signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well. When I updated the file
package to use the new sbom.File*
types, I got this:
package command-line-arguments
imports github.com/anchore/syft/cmd
imports github.com/anchore/syft/internal/anchore
imports github.com/anchore/syft/internal/formats/syftjson
imports github.com/anchore/syft/internal/formats/syftjson/model
imports github.com/anchore/syft/syft/file
imports github.com/anchore/syft/syft/sbom
imports github.com/anchore/syft/syft/file: import cycle not allowed
I don't have a good answer for this (not why, that is known, but on the path forward with these new types). For now I'm going to leave the primitives instead of introducing the new types. This can be a problem we solve later.
@@ -67,6 +69,14 @@ func Catalog(resolver source.FileResolver, theDistro *distro.Distro, catalogers | |||
// generate PURL | |||
p.PURL = generatePackageURL(p, theDistro) | |||
|
|||
// create file-to-package relationships for files owned by the package | |||
owningRelationships, err := packageFileOwnershipRelationships(p, resolver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know catalog is being given the responsibility of building these relationships so are we keeping it around going forward? Seems like an ok spot to land for relationship generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cataloging has the ability to raise up relationships when it makes sense (shifting that responsibility "left" towards the catalogers when it has the "most knowledge" about the relationship in question, say, tracking a graph of transitive dependency for a package... only the cataloger has enough information to capture these relationships). Though there will be other times when crafting relationships are better served on the "right" after all cataloging is completed (say when elements from different catalogers may be related).
This will probably shake out over time and we can adjust where this lives as we see better places for it. (for now I'll leave it where it is --though shout out if you see a better spot for it)
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
eadb555
to
aade54d
Compare
@spiffcs I've addressed all of your comments --ready for re-review! 馃帀 馃尞 |
* remove power-user document shape Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add power-user specific fields to syft-json format Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * port remaining spdx-json relationships to sbom model Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add coordinate set Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add SBOM file path helper Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * use internal mimetype helper in go binary cataloger Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add new package-of relationship Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update json schema to v2 Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * replace power-user presenter with syft-json format Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix tests and linting Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * remove "package-of" relationship (in favor of "contains") Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add tests for spdx22json format encoding enhancements Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update TODO and log entries Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * introduce sbom.Descriptor Signed-off-by: Alex Goodman <alex.goodman@anchore.com> Signed-off-by: fsl <1171313930@qq.com>
* remove power-user document shape Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add power-user specific fields to syft-json format Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * port remaining spdx-json relationships to sbom model Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add coordinate set Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add SBOM file path helper Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * use internal mimetype helper in go binary cataloger Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add new package-of relationship Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update json schema to v2 Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * replace power-user presenter with syft-json format Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix tests and linting Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * remove "package-of" relationship (in favor of "contains") Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add tests for spdx22json format encoding enhancements Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update TODO and log entries Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * introduce sbom.Descriptor Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
This is a follow up to #607 and #623
Closes #556
Changes made:
spdx22json
format encoder and moves file-ownership relationship creation back to the mainCatalog
function.sbom.AllCoordinates
that returns a slice of all coordinates referenced throughout all SBOM artifacts and relationships. This is needed to guarantee inclusion of all referenced files as nodes in the SBOM.syftjson
format model and thepoweruser.Presenter
has been deleted (as well as the remainder ofinternal/presenters
馃帀 馃尞 )fileMetadata
,fileClassifications
, andfileContents
power-user JSON fields have been removed and replaced with a singlefiles
section. This requires bumping the JSON schema version from 1 to 2 .packages
output (forsyftjson
format). This now matches the same behavior as power-user.Notes:
schema/json/schema-2.0.0.json
is not intended for manual review (~1000 lines).Questions:
syft1json
andsyft2json
formats for backwards compatibility, though this is not planned at this time. Should we do this before bumping to schema version 2?