Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow the use of "OTHER" as a declared license #7836

Open
sschuberth opened this issue Oct 30, 2020 · 19 comments
Open

Disallow the use of "OTHER" as a declared license #7836

sschuberth opened this issue Oct 30, 2020 · 19 comments

Comments

@sschuberth
Copy link
Member

sschuberth commented Oct 30, 2020

Several curations use "OTHER" as the declared license, e.g.

First of all, there is a general problem as "OTHER" is not a valid SPDX expression. Secondly, at the concrete example of jsonify, consuming the ClearlyDefined curation worsens the meta data from "Public Domain" as declared in its package.json to "OTHER", which is even less telling, and causes ORT (which has a mapping from "Public Domain" to "LicenseRef-scancode-public-domain-disclaimer") to run into issues.

That's why I'd like to propose to not use "OTHER" at all. What do you think @capfei @fossygirl?

@fossygirl
Copy link
Member

@sschuberth I absolutely see your points. This was a request from our curation community. I'll let @ariel11 weigh in.

@ariel11
Copy link
Contributor

ariel11 commented Oct 30, 2020

@sschuberth and @fossygirl - ClearlyDefined uses several non-SPDX identifiers: OTHER, NONE, and NOASSERTION. These are further explained in the curation guidelines here.

The challenge we have is a fair number of projects/components do not have a license with a SPDX match. When there is license information but the tooling is not able to tell what the license is, it says NOASSERTION. We want a way to be able to say "hey, a human looked at this, there is license text/language here, but there's not a SPDX for it." That's when we use "OTHER."

I agree, "OTHER" becomes a large bucket that can cover multiple scenarios - it could be for projects that say they are in the public domain, or projects with proprietary or commercial licenses, or projects that modified their license text so it no longer fits the SPDX matching guidelines.

Another goal of the project is to resolve as many NOASSERTIONS as we can. Sometimes we look at these and we are able to identify a SPDX identifier for the license so we update the definition. Other times, we find license info but there's no SPDX match, so we put OTHER. If we just left these as NOASSERTION, we would not have a way to track which definitions we've looked at, if they are left as NOASSERTION.

In my opinion, we need OTHER or an equivalent (or better) solution. Maybe we want to add a "public domain" option, knowing it does not equate to any specific text - rather, it just means the author(s) have dedicated their project in some fashion to the public domain.

We've also talked about creating ClearlyDefined specific extensions to SPDX. I think that would be potentially great. I think @pombredanne raised this idea.

Thanks

@sschuberth
Copy link
Member Author

sschuberth commented Oct 31, 2020

ClearlyDefined uses several non-SPDX identifiers: OTHER, NONE, and NOASSERTION.

Please note that out of these only OTHER is not specified at all in the SPDX standard, but NONE, and NOASSERTION are valid to be used instead of a "real" SPDX expression.

When there is license information but the tooling is not able to tell what the license is, it says NOASSERTION. We want a way to be able to say "hey, a human looked at this, there is license text/language here, but there's not a SPDX for it." That's when we use "OTHER."

I must say I'm having a bit of a hard time following that rationale. I guess when you say "tooling" you mean license scanners, like ScanCode. But in my view ClearlyDefined data is not supposed to correct license findings from tooling like ScanCode, but it's supposed to amend a license that was (or not) declared as part of a software package's meta data. So, either there is license meta data in a software package, or there is not (at the example of NPM, either the license was filled out as part of package.json, or it was not). This determination should never return NOASSERTION.

If a license was declared in package meta data where there is no SPDX ID for it, in my view ClearlyDefined as two options: Not curate that license at all, or come up with a LicenseRef for it. Of course then the issue of how to name the LicenseRef pops up, but I believe it was actually @jeffmcaffer who suggested to use a hash of the license text then as a LicenseRef name.

I agree, "OTHER" becomes a large bucket that can cover multiple scenarios

Exactly, and in my particular example using data from ClearlyDefined ("OTHER") actually gives you less specific information than just looking at the original package ("Public Domain"). To be quite frank, I believe this is not acceptable for the goal that (I though that) ClearlyDefined has.

In my opinion, we need OTHER or an equivalent (or better) solution.

I agree that we need an equivalent that is fully SPDX compliant.

We've also talked about creating ClearlyDefined specific extensions to SPDX. I think that would be potentially great.

Here, I disagree. Creating "extensions" to a standard that is not really meant to be dynamically extended is not a good idea IMO. But in my view we also don't need to do that, as SPDX already provides all the means to provide an equivalent to "OTHER" by using LicenseRef. Either you generate the names via hashes on license files as mentioned above, or you use something like "LicenseRef-clearlydefined-public-domain", i.e. LicenseRefs with a ClearlyDefined namespace and a custom suffix.

@royaljust
Copy link

royaljust commented Oct 31, 2020

@sschuberth thanks for the feedback. As @ariel11 mentioned, the OTHER field distinguishes from NOASSERTION because of the human the curation process. OTHER gives more specific information that NOASSERTION - it tells you that a human confirmed that there was no SPDX license available. This was put in place so that human curators could distinguish between packages that are confirmed NOASSERTION by a human and those that need human attention and confirmation. Without it, human curators get stuck in an infinite loop 🏭 .

Are there other ways to do this? Absolutely. If this is a pain point, I think working with the engineering side of the project to meet both the human curator needs (essentially answering the question - have we looked at this before?) and preserve valid SPDX expressions might be helpful. Maybe a flag somewhere in the data? The LicenseRef idea is a good one too. Happy to be a part of that conversation.

A few responses to your specific points below. Happy to continue the conversation.

But in my view ClearlyDefined data is not supposed to correct license findings from tooling like ScanCode, but it's supposed to amend a license that was (or not) declared as part of a software package's meta data.

It does both - scan tools sometimes misidentifies licenses or erroneously throw a NOASSERTION.

So, either there is license meta data in a software package, or there is not (at the example of NPM, either the license was filled out as part of package.json, or it was not). This determination should never return NOASSERTION.

Some package ecosystems don't have good metadata, don't use metadata (like a git repo) or use a license link that is proprietary. The curators are not just solving for NPM, their also solving for a swath of ecosystems - gits, NuGets, etc. As Ariel pointed out, you can review the curation community's process here. Like everything, its not perfect. I know the community has been very happy to take feedback on that process.

Exactly, and in my particular example using data from ClearlyDefined ("OTHER") actually gives you less specific information than just looking at the original package ("Public Domain"). To be quite frank, I believe this is not acceptable for the goal that (I though that) ClearlyDefined has.

I would love for SPDX to have a PUBLIC DOMAIN flag or extension specific to clearlydefined as you propose. As I describe above OTHER should give you more certainty in a return of NOASSERTION because it means a human has looked at it.

@jeffmcaffer
Copy link
Member

This is a great conversation. I'm a fan of using LicenseRef if we can but not a fan of LicenseRef-clearlydefined-... (or any other variation there). I'd much rather have a hash-based strategy that would work across all the tools. Some rationale

  • hashes can be shared across all the tools. If ScanCode and ClearlyDefined both get trained to see the same license but through quirks of naming decide to call it differently, then we have a confusing scenario. It would be better if the two tools both resolved to the same id.
  • This may start involving "judgement". For the most part ClearlyDefined has tried to avoid making judgement or having opinions. Rather we seek to collect and present the data. Where it's clear, great, where it's fuzzy, the user has to make their own judgement.
  • ClearlyDefined curators would be put in a position of naming licenses. Certainly we can do that but that's a whole new world to develop. What's the naming strategy, who approves, ... Could be done, just more work
    *If we had a portfolio of license refs that matched license text then we'd need to have matching tech and start teasing apart matching templates etc. All the work that SPDX does. Would rather not have two naming authorities.
  • ClearlyDefined currently does not have any matching tech sot that would need to be developed or we train ScanCode (et al) but then have the id collision where we called it Foo but ScanCode already calls it Bar.

IMO the best solution here is to use the SPDX text normalization and than hash in a standard way that all tools can use. Then all licenses can have a hash and some licenses will also have a more readable name/id as we see today in SPDX.

@nellshamrell

@sschuberth
Copy link
Member Author

IMO the best solution here is to use the SPDX text normalization and than hash in a standard way that all tools can use.

I totally agree here. But I'm not aware of any "SPDX text normalization". Does SPDX define an algorithm on how to normalize text? If so, would you have a link?

I'd be very happy if you could implement this rather today than tomorrow. Because if we had this, we could avoid having OTHER as part of curated licenses and implement a strict check that a curated license has to be a valid SPDX expression, which would also avoid mistakes like the ones I'm fixing here.

@jeffmcaffer
Copy link
Member

Agreed. IIRC while not rocket science, it also was not quite as easy has trim, hash, go. There were lots of corners/edges to think about technically as well as coordination with the SPDX community and an aliasing strategy for when licenses eventually do get an ID. All to say, I doubt that it's going to happen soon unless some people pitch in an help drive.

@nellshamrell
Copy link
Contributor

Hello everyone!

I'm a the new Microsoft Principal Engineer responsible for Clearly Defined. Lots of good discussion here, I'm going to attempt to summarize it as requirements. I would love your feedback on whether they sound correct or not.

Problem

Whenever possible, Clearly Defined matches project licenses with an SPDX license expression However, Clearly Defined must sometimes curate projects that have license information, but the license information does not match an SPDX expression. In this case, our tooling currently defines the license as "NOASSERTION."

When a project has a license identifier of "NOASSERTION" in the Clearly Defined database, human curators attempt to discover the license and manually update it to an SPDX identifier. When there is a matching SPDX identifier, the curator updates the license for the project in the Clearly Defined DB. However, there is not always a matching SPDX identifier, sometimes this is because the project is using a non-OSI approved license, sometimes it is because the license is unclear, etc.

In that case, the human curator needs a way to indicate that the project has been reviewed and no SPDX identifier match exists. At the moment, the human curator updates the project license to "OTHER" to indicate that a human has reviewed it, but no SPDX matching license identifier exists.

The problem is that "OTHER" is not specified in the SPDX standard, this can cause confusion when Clearly Defined data is consumed by other tools.

Requirements for a Solution

  1. Clearly Defined must have a way of indicating that a project does not have a license which matches an SPDX identifier.
  2. When a project does not have a license that matches an SPDX identifier, human curators must have a way to mark the project as having been reviewed by a human curator.

Possible solution

I don't believe tracking non-SPDX licenses is within the scope of Clearly Defined at this time or within the near future. We could certainly look into doing it at some point, but this strikes me as a big undertaking - not necessarily on the technical level (that is pretty straightforward) but on the coordination level with OSI, other tools we use, etc.

Something we could do is provide another way to indicate that a project has been reviewed by a human curator and confirmed to not have a matching SPDX license expression. I propose leaving those as "NOASSERTION", but adding a boolean field which indicates whether it has been reviewed by a curator, then surfacing that field where ever appropriate in the project.

@jeffmcaffer
Copy link
Member

Thanks Nell. Great summary of the situation. Your discussion triggered a thought that I'm not sure why we didn't have before. We can make up our own LicenseRef for "other". For example, LicenseRef-CD.Other (or some such). That is a valid SPDX license identifier and it captures/signals exactly what we want -- ClearlyDefined has looked and determined the license to be "other".

I like this over a separate boolean because it

  • allows for "other" in expressions (so MIT AND LicenseRef-CD.Other),
  • avoids adding a concept as well as various special case code paths,
  • avoids the broader notion of "curated". Curated has been problematic in the past as implies that curations which have not been curated are somehow lesser than those which have. Perhaps true in some contexts but perhaps also true that the need for the definition to have been curated is a bit of a smell in and of itself. So the boolean would need to be cast as "we looked and couldn't figure it out" rather than "curated". Either way, the LicenseRef path nicely skirts that topic.

@jeffmcaffer
Copy link
Member

BTW, the normalizer I was thinking is essentially an embodiment of https://wiki.spdx.org/view/Legal_Team/Templatizing/tags-matching which talks about the significance (or not) of various parts of a license text.

Then the SPDX tools (e.g., https://github.com/spdx/tools/blob/master/src/org/spdx/tools/MatchingStandardLicenses.java) use that via https://github.com/spdx/tools/blob/master/src/org/spdx/compare/CompareHelper.java to normalize the text during templatization/compare. At least that's my understanding.

@jeffmcaffer
Copy link
Member

On note on the LicenseRef approach. It's entirely possible that there are code paths that do not support LicenseRef. To date, ClearlyDefined has only dealt with "standard"/defined licenses. In effect LicenseRef is there to capture these non-standard licenses so would not have been in scope. A first step at this approach might be something like a string replace of OTHER with LicenseRef-CD.Other (again, whatever). So ClearlyDefined may still not know about LicenseRefs and not support them in expressions etc but at least the license value would be SPDX-compliant.

@sschuberth
Copy link
Member Author

Your discussion triggered a thought that I'm not sure why we didn't have before. We can make up our own LicenseRef for "other". For example, LicenseRef-CD.Other (or some such).

While that, also to me, sounded like an elegant solution at first, it unfortunately does not solve the "data quality" problem I've mentioned initially.

In ORT, we "blindly" consume curations from all our configured (=trusted) curation providers. That is, whenever we come across jsonify 0.0.0 (in this example) in an NPM project's dependency tree, the ORT analyzer already knows its license is "Public Domain", and we would pass that string down to ORT's evaluator. However, after consuming the ClearlyDefined curation, the license that the ORT evaluator sees is changed to "OTHER" (or "LicenseRef-CD.Other"). And from the perspective of someone who writes ORT policy rules, "Public Domain" is a meaningful string we can act on, but "OTHER" is not; we now even don't know anymore that this some public domain license.

That's why I believe @nellshamrell's proposal of a separate flag is the better solution.

But going one step back, I'm asking myself why ClearlyDefined even has a curation for jsonify 0.0.0 at all. Looking at the file it actually adds no valuable information (in terms of automating license compliance checks) and at least for ORT it would have been better if that curation wasn't there to begin with, so we wouldn't consume it, and thus not shade "Public Domain" with anything else.

BTW, if you have some comments about how ORT is supposed to consume ClearlyDefined curations, or if you believe ORT currently does it in the wrong way, please tell me. ORT came up with its own concept of curations about the same time when ClearlyDefined started. So we ended up adding ClearlyDefined just as another (compatible) provider of curations for ORT. But maybe the concept of ClearlyDefined is not (fully) compatible with ORT's ideas after all.

@jeffmcaffer
Copy link
Member

I'm likely missing something about the ORT scenario and how you are thinking about curated and non-curated data. For us, we don't really draw a distinction. some definitions are completely automated, some had to be fixed up. Perhaps the gap here is looking at ClearlyDefined "curations" rather than just ClearlyDefined definitions. For example, one could view ClearlyDefined as a backstop such that where ORT can't figure things out, use ClearlyDefined. Or inverted. Or as more authoritative (e.g., if ClearlyDefined has different data then favor it). It might be that CD runs some tools that ORT does not and gets a "better" answer even without human curation.

Put another way, how ClearlyDefined gets to a license determination is, to a certain degree, an implementation detail. From a licensing/compliance point of view NOASSERTION, NOASSERTION + curated=true, OTHER and LicenseRef-CD.Other are all the same. As a compliance officer, I need to do more work. The additional information carried by OTHER, the boolean, or LicenseRef-CD.Other is that a human did some more work and verified that indeed the license could not be figured out. As a compliance officer I can skip that part and dive right into what to do about my team using components with unknown licensing.

Ideally we'd never have to use this and there would be proper ids for all the licenses and the tools or humans would be able to figure it out and assign an id. For example, Public Domain came up a few times there. In that scenario is the core issue that SPDX doesn't have a generic public domain identifier (at least I don't think it does) or the specific ones needed?

What would/do you do differently if you see NOASSERTION with or without knowing that it is curated (recall that a curator can assign NOASSERTION as the value as well).

@sschuberth
Copy link
Member Author

Put another way, how ClearlyDefined gets to a license determination is, to a certain degree, an implementation detail.

That sentence probably is the important bit, at least for me. I was always believing that CD would "simply" be a database of human-curated metadata for software packages. Emphasis on "human" here because I'm not interested in metadata collected by some tool here. (Note that I'm not talking about scan results from a license scanner here, that's a totally different story.)

And when I say "metadata", I primarily mean source code location and declared license. So, when the ORT analyzer has determined all transitive dependencies of a project, and then fails to download one of the source packages because the URL is wrong, ORT would query CD to see if it knowns better where the source code is located.

Similarly, if CD has a declared license for a package, we "blindly" take that one and it overrides the original license declared in the packages metadata, if any. That license is then basically fed into the ORT evaluator which checks against policy rules. Next, there might be rules that know how to handle a license of "Public Domain", as for a human (who has written the rules) there are semantics attached to that string. And that's where we run into problems when we let CD override "Public Domain" with "OTHER", because "OTHER" cannot be handled in rules in a meaningful way.

That's why, sticking to the "Public Domain" example for jsonify, I actually would have expected there to be no curation at all for this package in CD, because CD simply cannot turn "Public Domain" into something that is a better / standardized representation of a Public Domain license, as there is no SPDX identifier for it. In this case, saying nothing would have been better than saying "OTHER", "NOASSERTION", or anything else that is less telling than "Public Domain".

@jeffmcaffer
Copy link
Member

Ideally ClearlyDefined would never need human intervention. All the tools would run perfectly and discover the required info with 100% certainty and accuracy. It would still serve a purpose as a one stop shop for all that info that originated in disparate forms and locations. So it's better to look at ClearlyDefined as an "really good source of compliance info" rather than a place for curating the data. As it is only a vanishingly small fraction of the 10+ M definitions in ClearlyDefined have human curations.

In your scenarios it's still not clear why you draw a distinction between machine or human generated info. With the above stated goal, ClearlyDefined would have better and better tools, the input projects would be better and better, and fewer and fewer humans would be involved. The information is still (potentially) better than what you have. If ORT goes to ClearlyDefined for a source location or a declared license, does it matter to your scenarios if that was determined through automation or human intervention?

As to the specifics of OTHER and Public Domain, if the value weren't OTHER, it would be NOASSERTION. Either way it's not Public Domain. This is a consequence of us deciding only to traffic in SPDX ids (with the now regrettable exception of OTHER). Since NOASSERTION and OTHER (or LicenseRef-CD.Other) are essentially ways of us saying "I dunno", would it make sense for you to filter those out and basically say, "If ClearlyDefined doesn't have a definitive answer, ignore them".

In essence that's our intention. NOASSERTION is a flag to humans saying "we don't know, you better figure it out". OTHER is a flag saying "we don't know and a human tried to figure it out but could. You better look"

I still prefer the LicenseRef approach over adding a boolean

  • Fits in the whole SPDX model of expressions and allows for NOASSERTION AND LicenseRef-CD.Other, a case where you have both
  • Avoids having a boolean flag. (98% of the time seems be regretted later)
  • Keeps the semantics of "we looked and still couldn't figure it out"
  • Fewer user concepts

Either way it seems you'll have code to the effect

if (definition.licensed.declared === 'NOASSERTION' || definition.licensed.declared === 'LicenseRef-CD.Other') 
  // ignore definition

@sschuberth
Copy link
Member Author

If ORT goes to ClearlyDefined for a source location or a declared license, does it matter to your scenarios if that was determined through automation or human intervention?

Indeed the data source would not matter in the end if the data gathered through tool automation was of the same quality as the data gathered by humans. But it isn't. The ORT analyzer already is an automation tool to gather data, but it fails sometimes, for example if no license is declared in package metadata, but only in prose on the project's home page. Sometimes it's really forensic effort to determine the license, and that currently requires a human. ORT is looking for such human-created curations to fixup its automatically determined metadata. We have our own ORT-specific database with human-created curations of high quality for that purpose, but I was hoping that CD would be another source in this regard.

As to the specifics of OTHER and Public Domain, if the value weren't OTHER, it would be NOASSERTION.

Again my question is: Why does it have to be anything? Can't it be just nothing?

I believe I start to grasp my own confusion here: CD seems to have the goal to "comment on" / "review" each and every software package out there, and needs a way to "mark it as reviewed". Whereas ORT's curation database only contains entries for those packages where metadata needed to be fixed up by a human in order to make it further processable by automation.

it make sense for you to filter those out and basically say, "If ClearlyDefined doesn't have a definitive answer, ignore them".

That's exactly what we started doing recently.

@jeffmcaffer
Copy link
Member

We do have a goal of broad coverage (automated or manual). We do not have a goal of "reviewing". Curating is only done when it is needed and we really hope that it's not. The whole issue here is that we want to capture the work done when someone investigates a NOASSERTION state and is unable to resolve it, so that others don't repeat the work (unknowingly).

Why does it have to be anything? Can't it be just nothing?
A given license value will be "nothing" (undefined) if there truly is nothing there. NOASSERTION comes about from the tools when they see there is something "licensey" or in a "licensey spot" but can't figure it out. IIRC this is pretty common across all the tools including likely the tools you're already using.

@nellshamrell
Copy link
Contributor

Hello again all!

I was just catching up on the comments in this issue.

Here is my summary of what we have established so far:

License Expressions

  • Clearly Defined only recognizes SPDX license expressions (this includes NONE and NOASSERTION) for use in definitions (with the exception of OTHER, explained more down below)
  • To my knowledge, "Public Domain" is not an SPDX recognized license
  • When Clearly Defined detects that there is license information for a project, but cannot determine what the license is, it marks the definition's license as NOASSERTION in the curation

Human Curations

  • When a license is NOASSERTION for a definition, a human curator may review it
  • Human curations are a very small percentage of all of Clearly Defined's curations - the goal of the project is to automate license gathering as much as possible, humans are only brought in when needed
  • When a human reviews a NOASSERTION curation and is unable to determine what the actual license is, they mark the license as OTHER

Problems with our current approach

  • Use of the word OTHER is confusing - we use it to indicate an undeterminable license, but that's not obvious to someone consuming the definition
  • OTHER is not an SPDX recognized license expression - this makes it difficult for someone consuming Clearly Defined's data to filter only on SPDX recognized expressions

Proposed solutions

If a license cannot be determined, why can't it be nothing/undefined?

  • A definition's license is sometimes left as undefined - but only in cases where the tooling truly cannot find anything that looks like a license of any kind in the software being defined by the definition.
  • In the context of Clearly Defined, an undefined license means "we looked and could not find anything related to licenses in this project"
  • When Cleary Defined detects that there is something "licensy" or in a "licensey spot", but cannot determine what the license is, it marks the license as NOASSERTION
  • In the context of Clearly Defined, a NOASSERTION value for a license means "We looked and there is something that looks like a license, but we can't determine what it is, a human had better check it out"

Can we add a boolean to a definition so, when it's license is NOASSERTION, we can still tell whether it has been reviewed by a human?

  • In this case, rather than marking a definition's license as OTHER, we would leave it as NOASSERTION
  • We would add a human_reviewed? (or something similar) boolean field to a definition which would indicate whether a human had looked at the definition and attempted to determine the license information
  • The advantage is, with no longer using "OTHER", we would only be using SPDX recognized license expressions for definitions ("NOASSERTION" is a recognized SPDX expression)
  • Correcting historical data should be straightforward with this approach as well - we could final all definitions with a license of "OTHER" and correct them to NOASSERTION and set the human_reviewed? boolean to be true
  • Then, when flagging definitions that require human curation, we would not flag a definition for review if the human_reviewed? boolean was true
  • The disadvantage of the approach is we would not be tracking anything about the "licensey" information in Clearly Defined's database, though it may be limiting Clearly Defined's scope in a way we would prefer (tracking only determinable SPDX license expressions)

Use a LicenseRef

  • For more information on LicenseRef, please see this SPDX documentation
  • Rather than using OTHER, we could use LicenseRef-ClearlyDefined.Other - which would be recognized as a valid SPDX license identifier
  • This would serve the purpose of indicating that someone did look at the definition (which was determined by our tooling as NOASSERTION), but we could not determine an SPDX license
  • Correcting historical data should be straightforward with this approach- we could final all definitions with a license of OTHER and correct them to LicenseRef-ClearlyDefined.Other

My Current Conclusion

Using a LicenseRef seems to be the approach most in line with Clearly Defined's intentions - a machine curated definition should not have heigher weight that a human curated definition. If we were to add a human_curated? boolean we would be implying that we expect all definitions to be human reviewed, which is not the case.

Using a LicenseRef with Clearly Defined does need to be scoped out (and that is work I will do next if there are no objections, the work itself will likely not start until next year). However, should it be achievable, it would allow us to indicate that a definition has been reviewed and the license determined to be OTHER (rather than NOASSERTION) but still only use SPDX recognized expression for licenses on definitions.

Does this make sense?

@jeffmcaffer
Copy link
Member

LGTM. Thanks for recapping all the various discussions.

Of the top of my head, here are some of the points to be looked at in implementation

  • LicenseRef:* values have never been tested in the code. There may well be places where some level of parsing or license reduction/expression handling fails on that syntax. Should be relatively straighforward but there may be several places to look.
  • We may even have code that explicitly rejects LicenseRef values as we originally said we would only handle licenses with SPDX ids. I think we can remain true to that ideal (if we want) and still use LicenseRef:ClearlyDefined-Other under the same model that we have an exception for OTHER.
  • We can remove the special case code for OTHER
  • Data migration is already pointed out. This only needs to happen for definitions. You can consider upping the definition schema version which will invalidate all existing definitions and force them to be recomputed. Note that we likely should proactively recompute as their existence (or absence) may affect some API responses (e.g., list).
  • Keep in mind "search" and "mongo" scenarios. These should be automatically updated when a new iteration of a definition is computed. Worth validating. I believe that these should continue to work with existing data even if the service's schema version is updated as the data is still there etc. That is, I think the only version check is in GET /definition/...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants