New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI fails are hard to view #2591
Comments
You can enable github notifications on CI jobs. (Somehow I did this for rust-bitcoin, but I forget how.) On failing PRs you can also click the "Actions" tab up top which I think gives you a more accessible view. |
Github Action CI, not unlike most CI systems breaks down the moment things are no longer trivial. Generally no matter where I go sooner or later I just have a handful of parallel CI machines/workflows for system-level parallelism with distjoin workload each, then each executes bash script that actually takes care of running stuff with e.g. Gnu Parallel, outputting only things that failed, and doing all sorts of things that are too fancy to express in yaml. In short: IMO the way to go is to try to avoid having to touch any Yaml, and use real programming to actually handle the problem in scalable and portable way. E.g. just recently I changed our backward compatibility matrix test that runs all the tests against matrix of component versions to print only output of the test that failed: because otherwise we would have to download it and analyze manually. At this point I'm tempted to have a whole project with Rust command line tools for scripting CIs. |
We have mostly done this - our Github CI pretty-much just sets variables and calls our shellscript. But we use it to set the variables because then we get parallelism in the Github UI and (I think) we get more compute than we would if we were doing local parallelism within a single CI runner. |
Github will only allocate a fixed maxed amount of VMs running at the time for a project (I think 20?), so above certain number (depends on how many PRs typically run at the same) things are just start queuing and one needs to wait longer). In addition every VM has an initialization cost, so too many of them just makes things slower. But yeah - some VM-level (workflows, jobs) parallelism is great. I now see rust-bitcoin has lots of jobs. IMO these would be better expressed as one GNU parallel run, with a benefit of that it would run in parallel locally as well. See output from gnu parallel run that failed: https://github.com/fedimint/fedimint/actions/runs/8269976680/job/22626478246 . We get the output from ones that failed only + a summary. We could make the summary prettier or output only things that failed, but no one complained about it yet. :D Another question is - do you really need to test so many of these on every PR. Possibly a PR could run a subset, and MQ (merge queue) the whole thing or something like that. |
Anecdotally I've found CI way slower since we added the 180 odd jobs. |
I'd really like to avoid getting green CI and having @apoelstra have to tell me when his local CI fails while trying to merge, that slows the process down. But I'd also like to run a more minimum set of CI jobs quickly to get better feedback. That is why I wrote |
Proving to onlookers that the commit passes CI. If I look back on a PR from years ago where some strange change to cargo or my OS causes attempts to build that commit to fail, but I see Github CI passed at that time... I am more likely to write it off as "old stuff breaks sometimes" instead of "DID U GAIZ ACKSHUALLY TEST THIS GARBAGE!" accusations. |
Also, it catches the case where someone looks at the code, thinks "this should not affect anything" and tries to merge it with just a utACK. Which doesn't really matter for this project in particular, but it's nice. |
CI covers several things that my local CI doesn't -- notably stuff like msan/asan andtesting on other architectures. I might also not be testing examples; IIRC that was hard/manual to set up. Conversely my local CI tests every commit and has a larger feature matrix for the unit tests. I've seen many cases where one would pass but not the other. As for having CI run a random subset ... Murphy's law says that literally the first PR after we do that will exhibit some intermittent CI failure that isn't detected and then master will be broken :).
Yeah, I think this is true. Certainly, waiting for any specific job to run is way slower. I think we should re-consolidate some of the jobs and paralellize them using |
We now have 187 CI jobs. When a job fails one has to scroll through the list to find it, this is wearing out my mouse wheel and my patience - is there a better way to view the failed jobs?
(I did try a web search, unsuccessfully.)
The text was updated successfully, but these errors were encountered: