Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CVE data from AOSP #206

Open
DarkaMaul opened this issue Jun 22, 2021 · 9 comments
Open

Add CVE data from AOSP #206

DarkaMaul opened this issue Jun 22, 2021 · 9 comments

Comments

@DarkaMaul
Copy link

As part of my work, I've data for CVE coming from the Android Open Source Project that I'm willing to contribute as part of the OSSF Cve Benchmark.

Those CVE target compiled language (both C/C++ and Java) but I think they would still be relevant for static analysis (even before compilation) and there are issues already opened for the support of compiled version ( #125 ; #126 ...)

However, due to the number of vulnerabilities, I don't have the data for all the fields that are present in the benchmark

Below is a CVE already in the benchmark:

{
    "CVE": "CVE-2017-1000219",
    "state": "PUBLISHED",
    "repository": "https://github.com/KyleRoss/windows-cpu.git",
    "prePatch": {
        "commit": "da656c1a9d5edbf4e8bf0640f349aeb714a4f1a0",
        "weaknesses": [
            {
                "location": {
                    "file": "index.js",
                    "line": 82
                },
                "explanation": "Unsafe shell command constructed from library input"
            }
        ]
    },
    "postPatch": {
        "commit": "b75e19aa2f7459a9506bceb577ba2341fe273117"
    },
    "CWEs": [
        "CWE-078",
        "CWE-088"
    ]
}

And here is an example of what I could contribute :

{
    "CVE": "CVE-2015-1538",
    "state": "PUBLISHED",
    "repository": "https://android.googlesource.com/platform/frameworks/av",
    "prePatch": {
        "commit": "22412164dee9f658272a25833aebe83d05dfe4ff",
        "weaknesses": [
            {
                "location": {
                    "file": "media/libstagefright/SampleTable.cpp",
                    "line": 330
                },
                "explanation": ""
            },
            {
                "location": {
                    "file": "media/libstagefright/SampleTable.cpp",
                    "line": 376
                },
                "explanation": ""
            },
            {
                "location": {
                    "file": "media/libstagefright/SampleTable.cpp",
                    "line": 426
                },
                "explanation": ""
            }
        ]
    },
    "postPatch": {
        "commit": "cf1581c66c2ad8c5b1aaca2e43e350cf5974f46d"
    },
    "CWEs": [
        "CWE-189"
    ]
}

The only missing field is the "explanation" part, as it is not possible to automate it.
I checked the schema definition in the repository and it was not a required field ( here )

So to wrap up this issue :

  • is the missing field really important ?
  • do I miss something important in the JSON ?
  • do you already accept submissions for other language (namely C/C++ and Java) ?
@esbena
Copy link
Contributor

esbena commented Jun 22, 2021

We would love to have benchmark entries for your CVEs (this is highly related to #67, #68, by the way), and your example looks reasonably close to being useful (see comments below).

Out of interest, I suppose you have some raw data that you intend to generate the benchmark entries from. Is that data publicly accessible somewhere? (secondarily: is the information restricted by a usage license?)

I'll try to answer your summarized questions.

do you already accept submissions for other language (namely C/C++ and Java) ?

Yes, your submission would be the first though, and there are currently no tool drivers that support those languages (which may be seen as a bootstrapping problem. Does your raw data perhaps have information about analysis tools that flags or should have flagged the weaknesses?).

(I am personally looking into the adding proper support for Java and Python CVEs (#125, #126) through some additional optional data in the benchmark entries, and to open a PR when I am satisfied with the design. Realistically, I will have it ready in the early fall.)

is the missing field really important ?

Correction: the explanation field is currently required (your link is for another data type, you probably meant ts-defs.schema.json#L405).

The intention is that the benchmark entries should be self-contained: it should ideally be possible manually confirm that an analysis tool flags a weakness for the right reason without having to dig various CVE discussions and CWE descriptions. And secondarily, confirming that the proposed postPatch indeed addresses the weakness. But the explanations are indeed the least valuable information in a benchmark entry.

We could consider making explanation optional, and then relying on the community to fill in the blanks for benchmark entries that are too hard to understand without explanation.

I now wondering if a dedicated UI could help produce useful explanations (I'll type up an issue shortly): present the user for as much information as possible (weakness source code, patch, CVE description, CWE titles, commit messages, ...), and then have him type or copy/paste in an explanation.

I suppose the raw data that your are exporting from contains the References links found in the official CVE entry https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-1538. Or do have additional links that could help a subsequent triaging in a UI?

do I miss something important in the JSON ?

Two thoughts: git host and correctness.

I note that the source code for they example CVE is hosted as git on android.googlesource.com. As per #2, some of the tooling still only supports the github.com host (I think download rate limits would be the primary feature we would be missing). Perhaps the easiest way to add android.googlesource.com support, is to teach the implementation that github.com/aosp-mirror could be used as a mirror for that host.

Is the example line information correct? I am interpreting your example weaknesses to point to the following lines:

I do not think any of those lines match the CWE-189 title "Numeric Errors", although I see some multiplications close to those lines.

@DarkaMaul
Copy link
Author

Out of interest, I suppose you have some raw data that you intend to generate the benchmark entries from. Is that data publicly accessible somewhere? (secondarily: is the information restricted by a usage license?)

The data is indeed coming from the Android Security Bulletins. The website is under a Creative Commons Attribution 4.0 license (here) and the code under an Apache 2 license (to be taken with a grain of salt as I'm definitely not a lawyer).

Correction: the explanation field is currently required

My bad, I misread the definitions... My contribution would require a modification to the schema itself. Even if the field is nice to have, it does not cope well with an automated tooling as it requires manual input.
On a side note, I really like the idea of crowd-sourcing the explanation field.

I suppose the raw data that your are exporting from contains the References links found in the official CVE entry https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-1538. Or do have additional links that could help a subsequent triaging in a UI?

I'm actually doing exactly the opposite. I'm primary using the bulletins information, and then correlating with the CVE Circl API to retrieve missing fields (like CWE).

Perhaps the easiest way to add android.googlesource.com support, is to teach the implementation that github.com/aosp-mirror could be used as a mirror for that host.

It is even easier for my side to replace the original link from android.googlesource.com to the mirror hosted on GitHub. As we already mentioned modifying the schema, could it be possible to store additional URLs for the repository key so we could also link the original source ?

Is the example line information correct?

In short, no... I had a slight offset in my data that I did not check before posting. The referenced line is the beginning of the patch hunk (the context around the patch) as seen here . The actual modified line is at +3 for these snippets :

@esbena
Copy link
Contributor

esbena commented Jun 22, 2021

The data is indeed coming from the Android Security Bulletins.

Oh wow, and you have the commit information for most/all of them? Do you have a ballpark estimate of how many benchmark entries you could end up adding?

Even if the field is nice to have, it does not cope well with an automated tooling as it requires manual input.
On a side note, I really like the idea of crowd-sourcing the explanation field.

I think that is an excellent way to view the problem: there's a high likelihood that all of the other data exists in structured form already, so it it a shame if that subjective explanation limits us. On the other hand, an explanation ensures that the data has been through some vetting process, ensuring that such of-by-three errors do no sneak in in bulk.

I think I am close to convincing myself that it would be fine to have the data without an explanation, but that such benchmark entries shouldn't be considered complete, meaning that they by default will not be used in evaluations. Once the entries get their explanation through some form of crowd sourced process they will automatically be considered complete (perhaps analysis vendors that support a specific class of CVEs are eager to add some explanations).

As we already mentioned modifying the schema, could it be possible to store additional URLs for the repository key so we could also link the original source ?

I think that would make sense, the type could be generalized to something like repository: UrlString | [UrlString, ...UrlString[]], this would allow the implementation to choose the repository to make use of in case there were multiple. In this case, the implementation would simply choose the github.com repository since there's support for that host already.

@DarkaMaul
Copy link
Author

Oh wow, and you have the commit information for most/all of them? Do you have a ballpark estimate of how many benchmark entries you could end up adding?

A bit more than 1800 as today but we keep increasing it with new bulletins.

I think I am close to convincing myself that it would be fine to have the data without an explanation, but that such benchmark entries shouldn't be considered complete, meaning that they by default will not be used in evaluations.

I see your point here. Another option would be to have a fast/complete set of CVEs for quick experiments where all information is present while a bigger dataset with less accurate data could be used for further analyses.

I think that would make sense, the type could be generalized to something like repository: UrlString | [UrlString, ...UrlString[]]

Looks good to me ;)

@esbena
Copy link
Contributor

esbena commented Jun 23, 2021

I see your point here. Another option would be to have a fast/complete set of CVEs for quick experiments where all information is present while a bigger dataset with less accurate data could be used for further analyses.

I think we are talking about the same thing here. There's already an expressive grep-like selection mechanism for picking the CVEs of interest.
I am just talking about the default behaviour of the mechanism.

Concretely, I would prefer if the following two commands behaved identically:

# selects all CVEs from 2020, except for the ones that are incomplete (e.g. the CVE is disputed or the patch is insufficient)
./bin/cli list year:2020

# selects all CVEs from 2020 that has an "explanation" (NB: the has-explanation selector does not exist yet)
./bin/cli list year:2020 | ./bin/cli filter has-explanation

I think we are left with the following action items before we can consume your CVEs:

Next:

  • @DarkaMaul: fix the hunk-indexing mistake to get the right line numbers (it is fine to import the cases with simple hunks initially, CVEs with hunks of non-standard sizes can be imported later.)
  • @DarkaMaul: add the github.com mirror to the repository entries
  • @esbena: extend the repository type to allow multiple entries
  • @esbena: make explanation optional, and treat benchmark entries without explanation as incomplete
  • @DarkaMaul: Open PR with a enough (10, 20, 50?) randomly selected CVEs that we are confident in the general data quality

Later:

  • @DarkaMaul Open PR with remaining CVEs (perhaps in chunks, the current implementation was not designed with 10x scalability in mind so soon)
  • @??? (probably @esbena) implement UI for crowd sourcing explanations and other missing data
  • @DarkaMaul document somewhere how new CVEs can be imported from AOSP (a ping to @DarkaMaul is fine with me)

Does that sound reasonable?

@DarkaMaul
Copy link
Author

* [ ]  @DarkaMaul: fix the hunk-indexing mistake to get the right line numbers (it is fine to import the cases with simple hunks initially, CVEs with hunks of non-standard sizes can be imported later.)

I'm on it.

* [ ]  @DarkaMaul: add the github.com mirror to the `repository` entries

Actually, I started to a small PoC for it but they are some repositories missing in the mirror. Any chance you may know who is in charge of this repository so they could be pinged?

For the record, I have close to an hundred missing repositories.

List of missing project
* [ ]  @DarkaMaul: Open PR with a enough (10, 20, 50?) randomly selected CVEs that we are confident in the general data quality

Do you prefer a PR for each CVE (e.g. 20 PR) or a PR with "all" CVEs (e.g. 1PR) ?
Moreover, the organization has also specific repositories for each CVE ( like this one ), do we need also to create one for each new entry ?

* [ ]  @DarkaMaul Open PR with remaining CVEs (perhaps in chunks, the current implementation was not designed with 10x scalability in mind so soon)

I could provide an additional personal repositories in the first time with the rest of the CVEs so anyone interested could just clone it separately and copy them into the official one.
We could discuss a bit later on how to merge both sets to still have an usable tool at the end.

* [ ]  @DarkaMaul document somewhere how new CVEs can be imported from AOSP (a ping to @DarkaMaul is fine with me)

I don't know if the code responsible for the automation could be open sourced on my side, but I will check ;)

@esbena
Copy link
Contributor

esbena commented Jun 24, 2021

Actually, I started to a small PoC for it but they are some repositories missing in the mirror. Any chance you may know who is in charge of this repository so they could be pinged?

No. I found https://aosp-mirror.github.io/ which hints that it may be deliberate:

Best effort
Finally, the freshness of the GitHub mirrors are provided on a "best effort" basis (though they're generally less than an hour behind). Additionally, not all repositories may be mirrored to GitHub. To see the most up-to-date list of Android repositories, visit https://android.googlesource.com/.

I could provide an additional personal repositories in the first time with the rest of the CVEs so anyone interested could just clone it separately and copy them into the official one.

I would prefer if the benchmark entries only contained the official sources.

All in all, it is probably best to keep it simple and simply submit a single non-mirror repository per benchmark entry (https://android.googlesource.com).
I'll add rudimentary support for that host.


Do you prefer a PR for each CVE (e.g. 20 PR) or a PR with "all" CVEs (e.g. 1PR) ?

I think I would prefer 1 PR with multiple CVEs, lets see how that works out.


Moreover, the organization has also specific repositories for each CVE ( like this one ), do we need also to create one for each new entry ?

Yes. bin/cli upload-commits can take care of that once we merge the PRs.

@DarkaMaul
Copy link
Author

I've started the PR #212 .

However, there are some issues that will prevent an "automated" import. The benchmark's model assigns vulnerabilities for lines of code, but in my case, I must extrapolate from the fix commit. This is not always easy as it may be hard to identify the precise defect from a patch, and especially which lines should be flagged.

For example, the patch for CVE-2015-5310 adds a check to verify if the Protected Management Frames is enabled (I'm not familiar with this vuln, I just chose it as an example). How do we pinpoint a specific line of code which should be flagged by a static analysis tool for this CVE ?

@esbena
Copy link
Contributor

esbena commented Jun 29, 2021

How do we pinpoint a specific line of code which should be flagged by a static analysis tool for this CVE ?

Oh, this is a problem. I thought the AOSP dataset had that precise information already. But I guess that the listed hunks are just small patch diffs. The location of a weakness is the hardest information to obtain in this project, and it is impossible to automate: it needs the eyes of a security analyst.

For every CVE. You need to ask yourself: "On which line would I have liked an automated tool to highlight a weakness that is relevant to this CVE?" or "How could an automated tool have identified that the prePatch was vulnerable?"

For CVE-2015-5310, I doubt that this is possible with a general tool. But a specialized tool could perhaps have flagged https://android.googlesource.com/platform/external/wpa_supplicant_8/+/fdb708a37d8f7f1483e3cd4e8ded974f53fedace/wpa_supplicant/wnm_sta.c#188 with the message:

end = ptr + key_len_total;
wpa_hexdump_key(MSG_DEBUG, "WNM: Key Data", ptr, key_len_total);
while (ptr + 1 < end) {
/*               ^^^ Use of key_len_total without associated check for `wpa_sm_pmf_enabled` */

Note that the hunk may not even be near the location of the weakness. For path-injection vulnerabilities, analysis tools typically highlight the source code line where a file is accessed, but the fix is typically to add a sanitizer many lines above the access itself. So indexing into hunks is not guaranteed to give you any useful weakness locations.
This fact make me quite sceptical of the weaknesses that are being proposed without an associated explanation. I fear that it will be all too likely that the weakness location is not something that should have been flagged by an analysis tool.


At the meta-level: if we do not have any weaknesses for a benchmark entry, then there's nothing to benchmark on. The behaviour of two analysis tools can no longer be compared automatically. If we have wrong weakness locations (inferred from the hunks), then the benchmark value will degrade significantly.
We could add CVEs without known weaknesses to this repository, and hope that they get triaged one day (perhaps through the crowd source UI I dreamt up above). The prePatch and postPatch information is still valuable, but it will not contribute any direct value to the benchmark suites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants