CI fails are hard to view #2591

tcharding · 2024-03-13T22:48:52Z

We now have 187 CI jobs. When a job fails one has to scroll through the list to find it, this is wearing out my mouse wheel and my patience - is there a better way to view the failed jobs?

(I did try a web search, unsuccessfully.)

apoelstra · 2024-03-13T23:08:35Z

You can enable github notifications on CI jobs. (Somehow I did this for rust-bitcoin, but I forget how.)

On failing PRs you can also click the "Actions" tab up top which I think gives you a more accessible view.

dpc · 2024-03-14T01:47:06Z

Github Action CI, not unlike most CI systems breaks down the moment things are no longer trivial.

Generally no matter where I go sooner or later I just have a handful of parallel CI machines/workflows for system-level parallelism with distjoin workload each, then each executes bash script that actually takes care of running stuff with e.g. Gnu Parallel, outputting only things that failed, and doing all sorts of things that are too fancy to express in yaml.

In short: IMO the way to go is to try to avoid having to touch any Yaml, and use real programming to actually handle the problem in scalable and portable way.

E.g. just recently I changed our backward compatibility matrix test that runs all the tests against matrix of component versions to print only output of the test that failed:

fedimint/fedimint#4526

because otherwise we would have to download it and analyze manually.

At this point I'm tempted to have a whole project with Rust command line tools for scripting CIs.

apoelstra · 2024-03-14T07:35:41Z

We have mostly done this - our Github CI pretty-much just sets variables and calls our shellscript. But we use it to set the variables because then we get parallelism in the Github UI and (I think) we get more compute than we would if we were doing local parallelism within a single CI runner.

dpc · 2024-03-14T16:03:02Z

Github will only allocate a fixed maxed amount of VMs running at the time for a project (I think 20?), so above certain number (depends on how many PRs typically run at the same) things are just start queuing and one needs to wait longer). In addition every VM has an initialization cost, so too many of them just makes things slower. But yeah - some VM-level (workflows, jobs) parallelism is great.

I now see rust-bitcoin has lots of jobs. IMO these would be better expressed as one GNU parallel run, with a benefit of that it would run in parallel locally as well.

See output from gnu parallel run that failed: https://github.com/fedimint/fedimint/actions/runs/8269976680/job/22626478246 . We get the output from ones that failed only + a summary. We could make the summary prettier or output only things that failed, but no one complained about it yet. :D

Another question is - do you really need to test so many of these on every PR. Possibly a PR could run a subset, and MQ (merge queue) the whole thing or something like that.

tcharding · 2024-03-14T19:11:07Z

Anecdotally I've found CI way slower since we added the 180 odd jobs.

tcharding · 2024-03-14T19:44:28Z

Another question is - do you really need to test so many of these on every PR. Possibly a PR could run a subset, and MQ (merge queue) the whole thing or something like that.

I'd really like to avoid getting green CI and having @apoelstra have to tell me when his local CI fails while trying to merge, that slows the process down. But I'd also like to run a more minimum set of CI jobs quickly to get better feedback. That is why I wrote just sane and it works pretty well but its missing some pieces (eg using the correct nightly version). This does beg the question that if devs are using local builds to check their work and the primary merge guy is using local builds to do pre-merge checks what is CI doing?

junderw · 2024-03-15T00:11:21Z

what is CI doing?

Proving to onlookers that the commit passes CI.

If I look back on a PR from years ago where some strange change to cargo or my OS causes attempts to build that commit to fail, but I see Github CI passed at that time... I am more likely to write it off as "old stuff breaks sometimes" instead of "DID U GAIZ ACKSHUALLY TEST THIS GARBAGE!" accusations.

junderw · 2024-03-15T00:14:08Z

Also, it catches the case where someone looks at the code, thinks "this should not affect anything" and tries to merge it with just a utACK.

Which doesn't really matter for this project in particular, but it's nice.

apoelstra · 2024-03-15T08:05:36Z

CI covers several things that my local CI doesn't -- notably stuff like msan/asan andtesting on other architectures. I might also not be testing examples; IIRC that was hard/manual to set up.

Conversely my local CI tests every commit and has a larger feature matrix for the unit tests.

I've seen many cases where one would pass but not the other.

As for having CI run a random subset ... Murphy's law says that literally the first PR after we do that will exhibit some intermittent CI failure that isn't detected and then master will be broken :).

Anecdotally I've found CI way slower since we added the 180 odd jobs.

Yeah, I think this is true. Certainly, waiting for any specific job to run is way slower. I think we should re-consolidate some of the jobs and paralellize them using parallel or something.

tcharding mentioned this issue Mar 26, 2024

CI: Re-write run_task.sh #2633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI fails are hard to view #2591

CI fails are hard to view #2591

tcharding commented Mar 13, 2024

apoelstra commented Mar 13, 2024

dpc commented Mar 14, 2024 •

edited

apoelstra commented Mar 14, 2024

dpc commented Mar 14, 2024

tcharding commented Mar 14, 2024

tcharding commented Mar 14, 2024

junderw commented Mar 15, 2024

junderw commented Mar 15, 2024

apoelstra commented Mar 15, 2024

CI fails are hard to view #2591

CI fails are hard to view #2591

Comments

tcharding commented Mar 13, 2024

apoelstra commented Mar 13, 2024

dpc commented Mar 14, 2024 • edited

apoelstra commented Mar 14, 2024

dpc commented Mar 14, 2024

tcharding commented Mar 14, 2024

tcharding commented Mar 14, 2024

junderw commented Mar 15, 2024

junderw commented Mar 15, 2024

apoelstra commented Mar 15, 2024

dpc commented Mar 14, 2024 •

edited