Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new workflow approval logic #261

Merged
merged 10 commits into from Sep 9, 2022

Conversation

holly-cummins
Copy link
Contributor

Resolves #256, although we could go further and add more complex checks.

Things we could add next:

  • More complex tests for users
    ** Are they a member of a large org?
    ** How old is their account?
    ** How long have they been contributing to this repository?

  • Comments on work items for approvals and non-approvals.

    • Tagging of people to have a look at work items which were not approved.

Dependencies

Before this PR can be merged, we need

I had to add the lenient() keyword to avoid failures with mvn install although mvn quarkus:dev passed clean, which I think may be related to #260.

@gsmet
Copy link
Member

gsmet commented Aug 18, 2022

@holly-cummins I upgraded Quarkus GitHub App with the latest GitHub API and released it. It's all upgraded in main now so you just need to rebase this.

@holly-cummins holly-cummins force-pushed the holly-workflow-approver branch 3 times, most recently from 7957f47 to 3706de5 Compare August 23, 2022 10:09
@holly-cummins
Copy link
Contributor Author

Thanks for pulling that all through so quickly, @gsmet - I've rebased, and also fixed some test issues that crept in when I ran mvn net.revelc.code.formatter:formatter-maven-plugin:2.20.0:format because yaml.

Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I added some comments inline.

src/main/java/io/quarkus/bot/ApproveWorkflow.java Outdated Show resolved Hide resolved
Comment on lines 184 to 192
GHRepositoryStatistics statistics = workflowPayload.getRepository().getStatistics();
if (statistics != null) {
PagedIterable<GHRepositoryStatistics.ContributorStats> contributors = statistics.getContributorStats();
for (GHRepositoryStatistics.ContributorStats contributor : contributors) {
if (workflowPayload.getSender().getLogin().equals(contributor.getAuthor().getLogin())) {
return contributor;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about this. With a repository the size of Quarkus, it's going to trigger quite a lot of queries if you're not lucky. Also this triggers a statistics computation on the GH side if not already cached and it takes quite a lot of time (and I have no idea of the refresh rate).

I think we need a cache for this with a list of the already validated user logins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll update with some caching. How long-lived is the app process? Will a static map do, or will that be reloaded too often?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think a static map would do. My guess is that at some point we could have this stored into a repository file and have commands to actually add someone to this file. But that's something for another day I would say.

Copy link
Member

@gsmet gsmet Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just to be clear: you need to add to the Map only the ones who qualifies for it to work if you're going fully static. Because once they qualify, we can be sure they always qualify.

Also, I'm wondering if we should use https://docs.github.com/en/rest/search#search-issues-and-pull-requests instead. With this API, we can search by author IIRC (it's the API used in the pull requests page so author:someone should work). And you could count the pull requests merged instead of relying on the statistics.

If we go this path, I think a cache with the ones who qualify will be good enough. If we don't, then we might have to have a periodically refreshed cache for the ones who don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like is:pr is:merged author:gsmet and then you would define the pageSize to the limit, call nextPage() on the PagedIterator and check if you have at least limit elements.

I think that's probably the most efficient way to check the number of contributions. Also better to base this thing on the number of PR merged rather than the number of commits so it's all win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented caching based on the stats, so we can see what it looks like.

I think no matter which way we go, we'd need to cache a count of contributions for each user, not just a yes/no per user, because the rules could change. For example, if we say a user needs 5 contributions, and then we realise that bar is too low and all sorts of dubious stuff is getting through so we change the bar to 20, we don't want all the pre-approved people to stay pre-approved. The whole reason we changed the rule was to de-approve them. :) It would be similar in the opposite direction, if we realised we had the bar too high and wanted to lower it to reduce admin load, we'd want that change to apply with immediate effect to people who previously hadn't qualified. And it's so easy to cache the count I don't think there's any particular downside.

Whether we do repository stats or query for merged PRs is a harder call, I think. There’s some advantages to keeping the stats:

  • It’s already written, I am lazy, and mocking against the github api is tedious :)
  • It allows us to extend to add new rules like “how long has this person been contributing", which is information that's already in the stats
  • In the stats, ‘total’ is an instance /top-level json variable so there’s no further API requests to get it
  • It’s exactly the information we need, so it’s possible we get the benefit of github's server-side caching for it in a way we wouldn’t if we tried to assemble the information ourself (https://api.github.com/repos/quarkusio/quarkus/contributors appears to be near-instantaneous). Asking for the PRs for a user would mean a call per user, whereas with the repository stats a smaller number of calls that would allow us to cache the stats for every user.

On the other hand, I think the point about merged PRs being more relevant than contributions is a good one. So I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other argument in favour of doing the request for each user is that what I had to implement was kind of ugly, with two levels of caching. In an ideal world, we would fetch the paged iterable of stats, and cache individual contributors as we traverse it. That way we might entirely warm the cache after we retrieve just one or two users. However, we can't do that until we have quarkusio/quarkus#27488 in our Quarkus.

Without the ability to manually populate the cache, I ended up doing one cache for individual users, and a second cache for the results of fetching the repository stats. The first cache is perhaps redundant, but I left it in since it's semantically closer to what we actually want to cache.

README.adoc Outdated
Comment on lines 218 to 222
unless:
directories:
- ./github
files:
- bad.xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this specific part will be in practice. You can execute whatever you want on CI by adjusting the Java code of particular test.

Not saying you should remove it, it's there, let's keep it. But in the Quarkus repo, I think we won't be able to use it if we really want to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean. I was imagining people might want to restrict the .github folder, since that's where the workflows live, and the various levels of pom.xml, since changed/added dependencies is a common vector for malware. I agree someone determined enough could write something in Java without extra dependencies, but a check on pom.xml might increase the bar for how determined they'd have to be. The downside of the restrictions on pom.xml is that many non-trivial features would also change it, so restricting it would mean more manual approval work.

It might be worth changing the example to explicitly suggest pom.xml (realistic) rather than bad.xml (clear, but useless).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you can change it to pom.xml but really no files are safe as long as they are executed. So pom, any Java file being actually run, any workflow file and maybe others I'm not thinking about.

README.adoc Outdated
minContributions: 5
unless:
directories:
- ./github
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ./github
- .github

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good spot, thank you. The matching on the directory name does need the beginning dot-slash (which is annoying), so I think it should actually be ./.github. That looks a bit like bad ascii art collided with a product name, but I think it's correct. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I don't think so. But I think it's the same issue that the one below about endsWith/startsWith so let's talk there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR and do some logging to confirm the filenames we get from the github api are fully qualified. It's not in the test json in the repo, annoyingly, but I'm sure I saw it go by in my test logging (because I was a bit surprised it was the way it was).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are fully qualified. And I understand now what you’re trying to do.
I’ll get the PR and have a look at the code tomorrow.

I missed why you differentiated files and directories.
My intent with these filters was: either you are pushing a path from the root or you are using a glob.
With files you’re doing something else. But I’m not sure it’s such a good idea because I’m not sure you will be able to be consistent if you want to say exclude all Java files. You will then have to use a full glob with the directory present (well **/*.java) which is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll have a look tomorrow and think a bit about it. Don’t change anything yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for *.java you need a glob anyway and then the benefit of having a bit less globbing is marginal. Probably the best use case is for pom.xml where the filename is so standardised.

Which use case is more likely probably depends on what kind of attacks are more likely. Secret and credential extfiltration would be super-easy with any java file change. Something like bitcoin mining would probably pull in some new dependencies and change the pom.xml (or add something into the workflow, but that's covered by directories). https://www.trendmicro.com/vinfo/us/security/news/cybercrime-and-digital-threats/github-action-runners-analyzing-the-environment-and-security-in-action mentions the threats being

  • printing secrets (github filters logs to guard against it)
  • cryptomining (github sometimes detects)
    ... and then it goes very deep and a bit over my head. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I thought about it a bit more and I think having very different behaviors for files and directories is confusing.

I think I would rather have only one files entries with the current directories behavior and rely on glob patterns for files. That would be more consistent and less confusing.
At some point, I would also rename directories to files for the other rule but it needs careful deployment so I'll do that later specifically.
Also probably a good idea to cache the compiled patterns in a map if we end up using them a lot.
I can do it if you want as it's something I should probably have done before and is a bit orthogonal to your work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes lots of sense! I think renaming to 'files' is the un-locker that makes the behaviour more clear.

I guess the one thing that wouldn't be entirely obvious in the current/proposed behaviour is

files:
  - ./somedir

would match ./somedir/myfile even though there's no wildcard in it. I can just about make that make sense (to me), though, since a directory is a file so somedir is a file, and if we include it, of course we include its descendants.

It makes less sense that

files:
  - ./somed

would match ./somedir/myfile even though somed is not somedir. That's orthogonal to this change, though, since it's the existing behaviour. If we were really bothered we could enforce globbing and remove startsWith, but I'm not sure it's worth the effort. :)

I'll make the model change to my rules and see how I go with caching the patterns, since I'm on a roll with caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edda867 has the update (reversion of 'files' except that 'directories is now 'files'), and also the caching.
If nothing else, it's a lot less code. :)

for (String file : files) {

if (!file.contains("*")) {
if (changedFile.getFilename().endsWith(file)) {
Copy link
Member

@gsmet gsmet Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it used to be startsWith in the original implementation you copied but you changed it to endsWith. I don't think that's what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to use endsWith because getFilename returns the fully qualified name of the file, including the path. So for directories it's startsWith and for files it's endsWith. Doing proper path parsing to extract the basename would be safer, though. (And on the directory path, doing path parsing to extract the path and drop the basename might be a good idea, except that it would change the existing semantics for Triage.)

Looking at it again, I also see that the glob path is common across the directory and file methods. If it's truly common it should be extracted, but I suspect it shouldn't be common. At the very least, the basename should be passed in in the file path.

Copy link
Member

@gsmet gsmet Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you're saying. The existing code is working very well for pull requests triaging so I'm not sure why it became a problem.
Do you have an actual example of what's not working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code still works fine, I just added a new requirement.

Existing config:

                                    triage:
                                      rules:
                                        - directories:
                                            - foo/*
                                            - bar/*

It matches on directories, using startsWith().

Additional config:

                            workflows:
                                  rules:
                                    - allow:
                                        directories:
                                            - dir1
                                        files:
                                            - file1

It matches on directories, using startsWith(), and on files, using endsWith() (because to match someDir/someOtherDir/file1 against 'file1' you have to use endsWith). The startsWith() logic for directories is the same as what's used by the triage rule, so that codepath is shared across the two rules. It hasn't changed except by being pulled out to its own file. The files rule is unique to workflows so it's new code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's where we disagree then. For matching a file that is a directory and not at the root, you have to use a glob pattern and make it **/your-file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the above example would have to be:

                            workflows:
                                  rules:
                                    - allow:
                                        directories:
                                            - ./dir1 [forgot the leading slash in the above example!]
                                            - **/file1

It seems not entirely user-friendly, and like it would need extra explanation in the docs. ("If you want to specify a filename, put it in the 'directories' rule with globbing.") At the moment, with what I've implemented, the above all works (exactly as it did before), but as a bonus, there's a more-concise files: option.

@gsmet
Copy link
Member

gsmet commented Sep 6, 2022

@holly-cummins I pushed a couple of small changes. The most significant one being the renaming of directories to files. I think it's less confusing and that's what I had in mind.
I put in place a compatibility layer for the existing triaging feature.

If you have the set up all in place, could you build the bot in native (you can start it in native, it will start the smee.io proxy as always) and make sure things work. My main issue is with the Caffeine caches: depending on the configuration, they trigger the usage of different classes and not all cache implementations are included in Quarkus.
From what I can see, it starts, but I'm not sure all the caches are set up at startup.

If you can just check the feature works in native, that should be enough.

As for the stats, let's start this way and see how it goes.

@holly-cummins
Copy link
Contributor Author

Thanks so much for catching the directories/files mixup. If you'd asked me what I had done, I would sworn it was files - but the evidence of the PR says otherwise. :) I agree that files is clearer, and it's what I'd intended.

I'm working on testing on native, but even before getting there, I've hit an issue. I'm seeing an intermittent failure fetching all contributors:

Exception: java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Objects.java:208)
        at org.kohsuke.github.PagedIterator.nextPageArray(PagedIterator.java:140)
        at org.kohsuke.github.PagedIterable.toArray(PagedIterable.java:78)
        at org.kohsuke.github.PagedIterable.toArray(PagedIterable.java:106)
        at org.kohsuke.github.PagedIterable.toList(PagedIterable.java:118)
        at io.quarkus.bot.ApproveWorkflow.getContributorStats(ApproveWorkflow.java:189)

I can't reproduce it reliably enough to debug it – was it a network issue on my side? a glitch in the github service? Either way, it's troubling. So I will keep looking at that.

@holly-cummins
Copy link
Contributor Author

Some evidence for the "a glitch in the github api" theory is that I just got an error in the UI, while creating a pull request.

image

@gsmet gsmet merged commit 2dad127 into quarkusio:main Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finer-grained of workflow runs, between 'anyone but first-time contributors' and 'only committers'
2 participants