-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First class relationships #607
Conversation
Benchmark Test ResultsBenchmark results from the latest changes vs base branch
|
bf6a2cd
to
4757c25
Compare
4757c25
to
95bd1fd
Compare
19b9013
to
197c27b
Compare
I'm going to rebase since #606 was merged and has made this PR look a little crazy. |
e476f35
to
69d2b1b
Compare
Since the package fingerprinting and SBOM data structure PRs have landed I think I'm ready to start tackling replacing ID assignment with the fingerprinting method. This implies considering possible removal of the |
One complexity that has been introduced that I want to note before I get too far: since the package ID is a function of the bytes on the This extra complexity should remain only in the |
Question: Are we treating the static package ID values on any formats (that Syft can decode) as "okay to discard" during decoding? Specifically, I'm asking about more than using the IDs for things like discovering relationships, I'm asking about preserving the literal ID values themselves. This is a really interesting perspective, and it's making me question an early opinion I had of not retaining static ID values on native Syft packages (in the business layer). As we think about using Syft as a library, I'm wondering if consumers of the library might have a reasonable expectation that they can find the original ID values (say, from there SPDX document that they Syft-decoded), within the business objects that Syft exposes in the library, for some task they're trying to accomplish. (Just wondering... I don't have an exact scenario in mind. Maybe some sort of policy logic.) |
d486512
to
52adfcb
Compare
With the last few commits in to prove out the PoC, I've been wondering the same thing. As a We've changed ID to be behavior of a package and not a static (random) data value on a package. One nicety of a stabilizing ID based on contents is that the ID matches between multiple runs of analyzing the same source. The downside is that if the package data is lossy from converting from format-to-format (in the near future) then the ID will be different between documents. I think we may have reached two mutually exclusive use cases here: one use case wanting ID stability based on "the definition of a package" and another use case wanting ID stability across different descriptions of the same basic package. We cannot use the same field to handle both use cases. Either we will need to declare that one of these use cases is unsupported or we will need to split up responsibilities between two different fields. (anyone see a third option?) |
That
Well said. My two cents: in general, I still favor the |
I landed on the same side of the fence as you @luhring --I think this is good enough for now, the new behavior is better than before, and when we have a specific use case that is raising a problem we can solve it then. |
I don't think bifurcating to other designated spots for an
If the definition of a package (before enoding to a format) is the same at the bottom of syft, then I believe there is a scenario where we fulfill both cases here so long as the generated ID happens BEFORE any format/presentation specific structs have a possibility of being hashed. |
} | ||
} | ||
|
||
func mergeResults(cs ...<-chan artifact.Relationship) <-chan artifact.Relationship { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: When dealing with more complex types like <-chan artifact.Relationship
I find it easier to reason about the function signature if the return value is also declared (results <-chan artifact.Relationship)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this breaks a convention we have in the codebase where we don't construct signatures that way feel free to ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also try to do the same thing you are suggesting for small functions. Though in this case it isn't possible to keep the same signature and do this suggestion. The implementation creates a bidirectional channel that is encapsulated within mergeResults
but a receive-only channel is returned (restricting callers from sending into the channel). If I were to use the signature to declare the return variable then I must change the return type to be a bidirectional channel, which isn't ideal.
re: #607 (comment) , I agree, the suggestion of keeping
I'd have to think on this; it appears that what you're saying is true within the scope of generating a single document, but not necessarily true when converting from one format to another (something that isn't supported by syft yet anyway, so isn't a concern today, but could be in the future). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments - I'll dig in again later
} | ||
} | ||
|
||
func mergeResults(cs ...<-chan artifact.Relationship) <-chan artifact.Relationship { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this breaks a convention we have in the codebase where we don't construct signatures that way feel free to ignore
syft/artifact/id.go
Outdated
ID() ID | ||
} | ||
|
||
func DeriveID(obj interface{}) (ID, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the larger comments on this PR talk about formalizing the ID across formats. I think a more strict signature might help with this.
If we have this as interface
what are all possible types within this project that could be passed here?
Can we limit it to some common field shared among all types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about doing something along those lines, where there would be a IDFromLocation(...)
and IDFromPackage(...)
in the artifact
package. The first problem is that this would introduce a cyclic dependency between package and artifact --ideally it should be the artifact package that is at the "root". I then realized that all of the ID implementations would be the same anyway (pass an object into Hash
). For this reason I went the direction of making a single utility function (also, I renamed this in a later commit to IDFromHash(...)
).
The largest con against this decision I see is that it would be possible to get an artifact.ID
for something that doesn't represent an artifact. However, since artifact.ID is a string, there is nothing stopping anyone from creating artifact.IDs --that is, there is no enforcement through encapsulation for this.
It could be that we move this function to a separate package... maybe internal/hash
? Then at least there wouldn't be any problems with cyclic dependencies.
@@ -38,21 +41,26 @@ func (c *Catalog) PackageCount() int { | |||
} | |||
|
|||
// Package returns the package with the given ID. | |||
func (c *Catalog) Package(id ID) *Package { | |||
func (c *Catalog) Package(id artifact.ID) *Package { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we return a copy now because we want to force the immutability of the catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly 👍 I just updated the PR description a minute ago (after you made these comments) to reflect these thoughts
@@ -21,7 +23,7 @@ var _ common.ParserFn = parseApkDB | |||
|
|||
// parseApkDb parses individual packages from a given Alpine DB file. For more information on specific fields | |||
// see https://wiki.alpinelinux.org/wiki/Apk_spec . | |||
func parseApkDB(_ string, reader io.Reader) ([]pkg.Package, error) { | |||
func parseApkDB(_ string, reader io.Reader) ([]pkg.Package, []artifact.Relationship, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we adding relationships as a placeholder for now? On parse what relationships are we expecting to be able to generate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a necessary placeholder for now 👍
Since the generic cataloger has been enhanced to allow for parser functions to raise up artifact relationships, and the APK cataloger uses the generic cataloger, that meant that this function signature needed to be updated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may not be any relationships that the APK DB parser will raise up, but other parser functions may have the opportunity to raise up relationships (such as gemfile.lock
and rpmdb
parsers). I haven't double checked the underlying data available for APK yet, but APK might have package dependency data for us to leverage.
@@ -775,7 +775,7 @@ func TestMultiplePackages(t *testing.T) { | |||
} | |||
}() | |||
|
|||
pkgs, err := parseApkDB(file.Name(), file) | |||
pkgs, _, err := parseApkDB(file.Name(), file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably TODO for future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this. There are a lot of places that ignore relationship data in test results now that parser functions raise up artifact relationships. I made certain to wire up anywhere that could raise up relationship data in the future in the cataloging execution path even if the cataloger in question does not surface relationships. However, I did not update the tests since this PR doesn't actually add any new relationships.
My take: any PR that is enhancing one or more catalogers to discover relationships should come with tests that prove the new functionality works. The lack of tests on a PR I think would be more of a signal than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disregard... TODO statements here we come...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 628c2e4
@@ -83,7 +83,7 @@ func candidateVendors(p pkg.Package) []string { | |||
// allow * as a candidate. Note: do NOT allow Java packages to have * vendors. | |||
switch p.Language { | |||
case pkg.Ruby, pkg.JavaScript: | |||
vendors.addValue("*") | |||
vendors.addValue(wfn.Any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks... that one took a little while to find and is one of the main reasons why I beefed up the encode-decode cycle tests for all of this.
From an offline conversation: it looks like |
I'll break this out into its own PR |
1d26197
to
628c2e4
Compare
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
628c2e4
to
77406a7
Compare
* migrate pkg.ID and pkg.Relationship to artifact package Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * return relationships from tasks Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix more tests Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add artifact.Identifiable by Identity() method Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix linting Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * remove catalog ID assignment Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * adjust spdx helpers to use copy of packages Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * stabilize package ID relative to encode-decode format cycles Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * rename Identity() to ID() Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * use zero value for nils in ID generation Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * enable source.Location to be identifiable Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * hoist up package relationship discovery to analysis stage Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update ownership-by-file-overlap relationship description Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add test reminders to put new relationships under test Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * adjust PHP composer.lock parser function to return relationships Signed-off-by: Alex Goodman <alex.goodman@anchore.com> Signed-off-by: fsl <1171313930@qq.com>
* migrate pkg.ID and pkg.Relationship to artifact package Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * return relationships from tasks Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix more tests Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add artifact.Identifiable by Identity() method Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * fix linting Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * remove catalog ID assignment Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * adjust spdx helpers to use copy of packages Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * stabilize package ID relative to encode-decode format cycles Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * rename Identity() to ID() Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * use zero value for nils in ID generation Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * enable source.Location to be identifiable Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * hoist up package relationship discovery to analysis stage Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * update ownership-by-file-overlap relationship description Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * add test reminders to put new relationships under test Signed-off-by: Alex Goodman <alex.goodman@anchore.com> * adjust PHP composer.lock parser function to return relationships Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
This PR primarily introduces the concept of an
artifact.Relationship
(instead of apkg.Relationship
) and adds a new element to thesbom.SBOM
struct (as a root level element, a sibling of theArtifact
field).As a result the following additions/changes were also made to support introducing
artifact.Relationship
:artifact.ID
as a typeIdentifiable
interface for objects that can return a stableartifact.ID
(and thereby be able to be part of aartifact.Relationship
)artifact
packageownership-by-file-overlap
relationship to now be discovered/created just after the cataloging phase (instead of within thesyftjson
format encoder as it is today)splitThis work has been broken out into a separate PRsource.Location
intosource.Coordinates
. This new object has the minimal information needed to get to a path within a source. This newsource.Coodinates
implementsartifact.Identifiable
such that files can form relationships with packages (setting up for Port the SPDX JSON files relationships in a separate PR)Additional considerations:
pkg.Catalog
altogether, however, elected to leave it for now to keep this PR smaller and increment towards a final state in a future TBD PR. This keeps the same semantics as the catalog does today in the sense that when a package is added to the catalog it is assumed that it will not be mutated. This is further enforced now by never exposing references to packages that are contained within the catalog.Partially addresses #556 ; remaining work: