Add new workflow approval logic #261

holly-cummins · 2022-08-17T19:23:18Z

Resolves #256, although we could go further and add more complex checks.

Things we could add next:

More complex tests for users
** Are they a member of a large org?
** How old is their account?
** How long have they been contributing to this repository?
Comments on work items for approvals and non-approvals.
- Tagging of people to have a look at work items which were not approved.

Dependencies

Before this PR can be merged, we need

Implement workflow run approval hub4j/github-api#1504 to be merged
A release of the github-api project with that change
A version bump of the dependency and release of the quarkus github-extension
A version bump and release of the quarkus-github-app

I had to add the lenient() keyword to avoid failures with mvn install although mvn quarkus:dev passed clean, which I think may be related to #260.

gsmet · 2022-08-18T19:46:37Z

@holly-cummins I upgraded Quarkus GitHub App with the latest GitHub API and released it. It's all upgraded in main now so you just need to rebase this.

holly-cummins · 2022-08-23T10:16:09Z

Thanks for pulling that all through so quickly, @gsmet - I've rebased, and also fixed some test issues that crept in when I ran mvn net.revelc.code.formatter:formatter-maven-plugin:2.20.0:format because yaml.

gsmet

Interesting. I added some comments inline.

src/main/java/io/quarkus/bot/ApproveWorkflow.java

gsmet · 2022-08-24T12:31:49Z

src/main/java/io/quarkus/bot/ApproveWorkflow.java

+                GHRepositoryStatistics statistics = workflowPayload.getRepository().getStatistics();
+                if (statistics != null) {
+                    PagedIterable<GHRepositoryStatistics.ContributorStats> contributors = statistics.getContributorStats();
+                    for (GHRepositoryStatistics.ContributorStats contributor : contributors) {
+                        if (workflowPayload.getSender().getLogin().equals(contributor.getAuthor().getLogin())) {
+                            return contributor;
+                        }
+                    }
+                }


I'm worried about this. With a repository the size of Quarkus, it's going to trigger quite a lot of queries if you're not lucky. Also this triggers a statistics computation on the GH side if not already cached and it takes quite a lot of time (and I have no idea of the refresh rate).

I think we need a cache for this with a list of the already validated user logins.

Good point. I'll update with some caching. How long-lived is the app process? Will a static map do, or will that be reloaded too often?

For now, I think a static map would do. My guess is that at some point we could have this stored into a repository file and have commands to actually add someone to this file. But that's something for another day I would say.

Oh just to be clear: you need to add to the Map only the ones who qualifies for it to work if you're going fully static. Because once they qualify, we can be sure they always qualify.

Also, I'm wondering if we should use https://docs.github.com/en/rest/search#search-issues-and-pull-requests instead. With this API, we can search by author IIRC (it's the API used in the pull requests page so author:someone should work). And you could count the pull requests merged instead of relying on the statistics.

If we go this path, I think a cache with the ones who qualify will be good enough. If we don't, then we might have to have a periodically refreshed cache for the ones who don't.

Something like is:pr is:merged author:gsmet and then you would define the pageSize to the limit, call nextPage() on the PagedIterator and check if you have at least limit elements.

I think that's probably the most efficient way to check the number of contributions. Also better to base this thing on the number of PR merged rather than the number of commits so it's all win.

I've implemented caching based on the stats, so we can see what it looks like.

I think no matter which way we go, we'd need to cache a count of contributions for each user, not just a yes/no per user, because the rules could change. For example, if we say a user needs 5 contributions, and then we realise that bar is too low and all sorts of dubious stuff is getting through so we change the bar to 20, we don't want all the pre-approved people to stay pre-approved. The whole reason we changed the rule was to de-approve them. :) It would be similar in the opposite direction, if we realised we had the bar too high and wanted to lower it to reduce admin load, we'd want that change to apply with immediate effect to people who previously hadn't qualified. And it's so easy to cache the count I don't think there's any particular downside.

Whether we do repository stats or query for merged PRs is a harder call, I think. There’s some advantages to keeping the stats:

It’s already written, I am lazy, and mocking against the github api is tedious :)

It allows us to extend to add new rules like “how long has this person been contributing", which is information that's already in the stats

In the stats, ‘total’ is an instance /top-level json variable so there’s no further API requests to get it

It’s exactly the information we need, so it’s possible we get the benefit of github's server-side caching for it in a way we wouldn’t if we tried to assemble the information ourself (https://api.github.com/repos/quarkusio/quarkus/contributors appears to be near-instantaneous). Asking for the PRs for a user would mean a call per user, whereas with the repository stats a smaller number of calls that would allow us to cache the stats for every user.

On the other hand, I think the point about merged PRs being more relevant than contributions is a good one. So I don't know.

One other argument in favour of doing the request for each user is that what I had to implement was kind of ugly, with two levels of caching. In an ideal world, we would fetch the paged iterable of stats, and cache individual contributors as we traverse it. That way we might entirely warm the cache after we retrieve just one or two users. However, we can't do that until we have quarkusio/quarkus#27488 in our Quarkus.

Without the ability to manually populate the cache, I ended up doing one cache for individual users, and a second cache for the results of fetching the repository stats. The first cache is perhaps redundant, but I left it in since it's semantically closer to what we actually want to cache.

gsmet · 2022-08-24T12:34:24Z

README.adoc

+        unless:
+           directories:
+             - ./github
+           files:
+             - bad.xml


Not sure how useful this specific part will be in practice. You can execute whatever you want on CI by adjusting the Java code of particular test.

Not saying you should remove it, it's there, let's keep it. But in the Quarkus repo, I think we won't be able to use it if we really want to be safe.

I know what you mean. I was imagining people might want to restrict the .github folder, since that's where the workflows live, and the various levels of pom.xml, since changed/added dependencies is a common vector for malware. I agree someone determined enough could write something in Java without extra dependencies, but a check on pom.xml might increase the bar for how determined they'd have to be. The downside of the restrictions on pom.xml is that many non-trivial features would also change it, so restricting it would mean more manual approval work.

It might be worth changing the example to explicitly suggest pom.xml (realistic) rather than bad.xml (clear, but useless).

Well you can change it to pom.xml but really no files are safe as long as they are executed. So pom, any Java file being actually run, any workflow file and maybe others I'm not thinking about.

gsmet · 2022-08-24T12:34:41Z

README.adoc

+              minContributions: 5
+        unless:
+           directories:
+             - ./github


Suggested change

- ./github

- .github

Oh good spot, thank you. The matching on the directory name does need the beginning dot-slash (which is annoying), so I think it should actually be ./.github. That looks a bit like bad ascii art collided with a product name, but I think it's correct. Do you agree?

Hmmm. I don't think so. But I think it's the same issue that the one below about endsWith/startsWith so let's talk there.

I'll make a PR and do some logging to confirm the filenames we get from the github api are fully qualified. It's not in the test json in the repo, annoyingly, but I'm sure I saw it go by in my test logging (because I was a bit surprised it was the way it was).

Yes they are fully qualified. And I understand now what you’re trying to do.
I’ll get the PR and have a look at the code tomorrow.

I missed why you differentiated files and directories.
My intent with these filters was: either you are pushing a path from the root or you are using a glob.
With files you’re doing something else. But I’m not sure it’s such a good idea because I’m not sure you will be able to be consistent if you want to say exclude all Java files. You will then have to use a full glob with the directory present (well **/*.java) which is not obvious.

I’ll have a look tomorrow and think a bit about it. Don’t change anything yet.

I agree for *.java you need a glob anyway and then the benefit of having a bit less globbing is marginal. Probably the best use case is for pom.xml where the filename is so standardised.

Which use case is more likely probably depends on what kind of attacks are more likely. Secret and credential extfiltration would be super-easy with any java file change. Something like bitcoin mining would probably pull in some new dependencies and change the pom.xml (or add something into the workflow, but that's covered by directories). https://www.trendmicro.com/vinfo/us/security/news/cybercrime-and-digital-threats/github-action-runners-analyzing-the-environment-and-security-in-action mentions the threats being

printing secrets (github filters logs to guard against it)

cryptomining (github sometimes detects)
... and then it goes very deep and a bit over my head. :)

OK so I thought about it a bit more and I think having very different behaviors for files and directories is confusing.

I think I would rather have only one files entries with the current directories behavior and rely on glob patterns for files. That would be more consistent and less confusing.
At some point, I would also rename directories to files for the other rule but it needs careful deployment so I'll do that later specifically.
Also probably a good idea to cache the compiled patterns in a map if we end up using them a lot.
I can do it if you want as it's something I should probably have done before and is a bit orthogonal to your work.

Ok, that makes lots of sense! I think renaming to 'files' is the un-locker that makes the behaviour more clear.

I guess the one thing that wouldn't be entirely obvious in the current/proposed behaviour is

files: - ./somedir

would match ./somedir/myfile even though there's no wildcard in it. I can just about make that make sense (to me), though, since a directory is a file so somedir is a file, and if we include it, of course we include its descendants.

It makes less sense that

files: - ./somed

would match ./somedir/myfile even though somed is not somedir. That's orthogonal to this change, though, since it's the existing behaviour. If we were really bothered we could enforce globbing and remove startsWith, but I'm not sure it's worth the effort. :)

I'll make the model change to my rules and see how I go with caching the patterns, since I'm on a roll with caching.

edda867 has the update (reversion of 'files' except that 'directories is now 'files'), and also the caching.
If nothing else, it's a lot less code. :)

gsmet · 2022-08-24T12:36:28Z

src/main/java/io/quarkus/bot/util/PullRequestFilesMatcher.java

+            for (String file : files) {
+
+                if (!file.contains("*")) {
+                    if (changedFile.getFilename().endsWith(file)) {


Hmmm, it used to be startsWith in the original implementation you copied but you changed it to endsWith. I don't think that's what we want.

I had to use endsWith because getFilename returns the fully qualified name of the file, including the path. So for directories it's startsWith and for files it's endsWith. Doing proper path parsing to extract the basename would be safer, though. (And on the directory path, doing path parsing to extract the path and drop the basename might be a good idea, except that it would change the existing semantics for Triage.)

Looking at it again, I also see that the glob path is common across the directory and file methods. If it's truly common it should be extracted, but I suspect it shouldn't be common. At the very least, the basename should be passed in in the file path.

I'm not sure I understand what you're saying. The existing code is working very well for pull requests triaging so I'm not sure why it became a problem.
Do you have an actual example of what's not working?

The existing code still works fine, I just added a new requirement.

Existing config:

triage: rules: - directories: - foo/* - bar/*

It matches on directories, using startsWith().

Additional config:

workflows: rules: - allow: directories: - dir1 files: - file1

It matches on directories, using startsWith(), and on files, using endsWith() (because to match someDir/someOtherDir/file1 against 'file1' you have to use endsWith). The startsWith() logic for directories is the same as what's used by the triage rule, so that codepath is shared across the two rules. It hasn't changed except by being pulled out to its own file. The files rule is unique to workflows so it's new code.

That's where we disagree then. For matching a file that is a directory and not at the root, you have to use a glob pattern and make it **/your-file.

So the above example would have to be:

workflows: rules: - allow: directories: - ./dir1 [forgot the leading slash in the above example!] - **/file1

It seems not entirely user-friendly, and like it would need extra explanation in the docs. ("If you want to specify a filename, put it in the 'directories' rule with globbing.") At the moment, with what I've implemented, the above all works (exactly as it did before), but as a bonus, there's a more-concise files: option.

Co-authored-by: Guillaume Smet <guillaume.smet@gmail.com>

I will do it for the existing triage features at some point but it needs to be properly coordinated.

gsmet · 2022-09-06T14:42:29Z

@holly-cummins I pushed a couple of small changes. The most significant one being the renaming of directories to files. I think it's less confusing and that's what I had in mind.
I put in place a compatibility layer for the existing triaging feature.

If you have the set up all in place, could you build the bot in native (you can start it in native, it will start the smee.io proxy as always) and make sure things work. My main issue is with the Caffeine caches: depending on the configuration, they trigger the usage of different classes and not all cache implementations are included in Quarkus.
From what I can see, it starts, but I'm not sure all the caches are set up at startup.

If you can just check the feature works in native, that should be enough.

As for the stats, let's start this way and see how it goes.

holly-cummins · 2022-09-07T12:20:33Z

Thanks so much for catching the directories/files mixup. If you'd asked me what I had done, I would sworn it was files - but the evidence of the PR says otherwise. :) I agree that files is clearer, and it's what I'd intended.

I'm working on testing on native, but even before getting there, I've hit an issue. I'm seeing an intermittent failure fetching all contributors:

Exception: java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Objects.java:208)
        at org.kohsuke.github.PagedIterator.nextPageArray(PagedIterator.java:140)
        at org.kohsuke.github.PagedIterable.toArray(PagedIterable.java:78)
        at org.kohsuke.github.PagedIterable.toArray(PagedIterable.java:106)
        at org.kohsuke.github.PagedIterable.toList(PagedIterable.java:118)
        at io.quarkus.bot.ApproveWorkflow.getContributorStats(ApproveWorkflow.java:189)

I can't reproduce it reliably enough to debug it – was it a network issue on my side? a glitch in the github service? Either way, it's troubling. So I will keep looking at that.

holly-cummins · 2022-09-07T12:44:56Z

Some evidence for the "a glitch in the github api" theory is that I just got an error in the UI, while creating a pull request.

holly-cummins force-pushed the holly-workflow-approver branch from c8bce98 to 56a11ca Compare August 17, 2022 19:27

holly-cummins force-pushed the holly-workflow-approver branch 3 times, most recently from 7957f47 to 3706de5 Compare August 23, 2022 10:09

gsmet requested changes Aug 24, 2022

View reviewed changes

holly-cummins requested a review from gsmet August 31, 2022 17:27

holly-cummins and others added 8 commits September 6, 2022 16:17

Add new workflow approval logic

82a24dc

Update src/main/java/io/quarkus/bot/ApproveWorkflow.java

bfc6d31

Co-authored-by: Guillaume Smet <guillaume.smet@gmail.com>

Cache contributor stats

7a7e2df

Simplify file matching to not distinguish between files and directories

0ca03c7

Add cache configuration for glob-cache

44b6c3c

Rename directories to files to avoid confusion

ac5c84e

I will do it for the existing triage features at some point but it needs to be properly coordinated.

Initialize files rules consistently

422f8fa

Deprecate directories config for triaging in favor of files

40467f4

gsmet force-pushed the holly-workflow-approver branch from edda867 to 40467f4 Compare September 6, 2022 14:39

gsmet approved these changes Sep 6, 2022

View reviewed changes

holly-cummins added 2 commits September 7, 2022 17:06

Narrow scope of listened events, and handle NPEs (a little bit)

1743622

Correct whitespace in docs, because yaml

8792ff2

gsmet approved these changes Sep 9, 2022

View reviewed changes

gsmet merged commit 2dad127 into quarkusio:main Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new workflow approval logic #261

Add new workflow approval logic #261

holly-cummins commented Aug 17, 2022

gsmet commented Aug 18, 2022

holly-cummins commented Aug 23, 2022

gsmet left a comment

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet Aug 24, 2022

gsmet Aug 24, 2022 •

edited

gsmet Aug 24, 2022

holly-cummins Aug 25, 2022 •

edited

holly-cummins Aug 25, 2022

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet Aug 24, 2022

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet Aug 24, 2022

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet Aug 25, 2022

holly-cummins Aug 25, 2022

holly-cummins Aug 25, 2022

gsmet Aug 24, 2022 •

edited

holly-cummins Aug 24, 2022

gsmet Aug 24, 2022 •

edited

holly-cummins Aug 24, 2022 •

edited

gsmet Aug 24, 2022

holly-cummins Aug 24, 2022

gsmet commented Sep 6, 2022

holly-cummins commented Sep 7, 2022

holly-cummins commented Sep 7, 2022

Add new workflow approval logic #261

Add new workflow approval logic #261

Conversation

holly-cummins commented Aug 17, 2022

Things we could add next:

Dependencies

gsmet commented Aug 18, 2022

holly-cummins commented Aug 23, 2022

gsmet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmet Aug 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holly-cummins Aug 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmet Aug 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmet Aug 24, 2022 • edited

Choose a reason for hiding this comment

holly-cummins Aug 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmet commented Sep 6, 2022

holly-cummins commented Sep 7, 2022

holly-cummins commented Sep 7, 2022

gsmet Aug 24, 2022 •

edited

holly-cummins Aug 25, 2022 •

edited

gsmet Aug 24, 2022 •

edited

gsmet Aug 24, 2022 •

edited

holly-cummins Aug 24, 2022 •

edited