Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI fails are hard to view #2591

Open
tcharding opened this issue Mar 13, 2024 · 9 comments
Open

CI fails are hard to view #2591

tcharding opened this issue Mar 13, 2024 · 9 comments

Comments

@tcharding
Copy link
Member

We now have 187 CI jobs. When a job fails one has to scroll through the list to find it, this is wearing out my mouse wheel and my patience - is there a better way to view the failed jobs?

(I did try a web search, unsuccessfully.)

@apoelstra
Copy link
Member

You can enable github notifications on CI jobs. (Somehow I did this for rust-bitcoin, but I forget how.)

On failing PRs you can also click the "Actions" tab up top which I think gives you a more accessible view.

@dpc
Copy link
Contributor

dpc commented Mar 14, 2024

Github Action CI, not unlike most CI systems breaks down the moment things are no longer trivial.

Generally no matter where I go sooner or later I just have a handful of parallel CI machines/workflows for system-level parallelism with distjoin workload each, then each executes bash script that actually takes care of running stuff with e.g. Gnu Parallel, outputting only things that failed, and doing all sorts of things that are too fancy to express in yaml.

In short: IMO the way to go is to try to avoid having to touch any Yaml, and use real programming to actually handle the problem in scalable and portable way.

E.g. just recently I changed our backward compatibility matrix test that runs all the tests against matrix of component versions to print only output of the test that failed:

fedimint/fedimint#4526

because otherwise we would have to download it and analyze manually.

At this point I'm tempted to have a whole project with Rust command line tools for scripting CIs.

@apoelstra
Copy link
Member

We have mostly done this - our Github CI pretty-much just sets variables and calls our shellscript. But we use it to set the variables because then we get parallelism in the Github UI and (I think) we get more compute than we would if we were doing local parallelism within a single CI runner.

@dpc
Copy link
Contributor

dpc commented Mar 14, 2024

Github will only allocate a fixed maxed amount of VMs running at the time for a project (I think 20?), so above certain number (depends on how many PRs typically run at the same) things are just start queuing and one needs to wait longer). In addition every VM has an initialization cost, so too many of them just makes things slower. But yeah - some VM-level (workflows, jobs) parallelism is great.

I now see rust-bitcoin has lots of jobs. IMO these would be better expressed as one GNU parallel run, with a benefit of that it would run in parallel locally as well.

See output from gnu parallel run that failed: https://github.com/fedimint/fedimint/actions/runs/8269976680/job/22626478246 . We get the output from ones that failed only + a summary. We could make the summary prettier or output only things that failed, but no one complained about it yet. :D

Another question is - do you really need to test so many of these on every PR. Possibly a PR could run a subset, and MQ (merge queue) the whole thing or something like that.

@tcharding
Copy link
Member Author

Anecdotally I've found CI way slower since we added the 180 odd jobs.

@tcharding
Copy link
Member Author

Another question is - do you really need to test so many of these on every PR. Possibly a PR could run a subset, and MQ (merge queue) the whole thing or something like that.

I'd really like to avoid getting green CI and having @apoelstra have to tell me when his local CI fails while trying to merge, that slows the process down. But I'd also like to run a more minimum set of CI jobs quickly to get better feedback. That is why I wrote just sane and it works pretty well but its missing some pieces (eg using the correct nightly version). This does beg the question that if devs are using local builds to check their work and the primary merge guy is using local builds to do pre-merge checks what is CI doing?

@junderw
Copy link
Contributor

junderw commented Mar 15, 2024

what is CI doing?

Proving to onlookers that the commit passes CI.

If I look back on a PR from years ago where some strange change to cargo or my OS causes attempts to build that commit to fail, but I see Github CI passed at that time... I am more likely to write it off as "old stuff breaks sometimes" instead of "DID U GAIZ ACKSHUALLY TEST THIS GARBAGE!" accusations.

@junderw
Copy link
Contributor

junderw commented Mar 15, 2024

Also, it catches the case where someone looks at the code, thinks "this should not affect anything" and tries to merge it with just a utACK.

Which doesn't really matter for this project in particular, but it's nice.

@apoelstra
Copy link
Member

CI covers several things that my local CI doesn't -- notably stuff like msan/asan andtesting on other architectures. I might also not be testing examples; IIRC that was hard/manual to set up.

Conversely my local CI tests every commit and has a larger feature matrix for the unit tests.

I've seen many cases where one would pass but not the other.

As for having CI run a random subset ... Murphy's law says that literally the first PR after we do that will exhibit some intermittent CI failure that isn't detected and then master will be broken :).

Anecdotally I've found CI way slower since we added the 180 odd jobs.

Yeah, I think this is true. Certainly, waiting for any specific job to run is way slower. I think we should re-consolidate some of the jobs and paralellize them using parallel or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants