Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update s390x dockerfile #1716

Merged
merged 3 commits into from May 17, 2024
Merged

Update s390x dockerfile #1716

merged 3 commits into from May 17, 2024

Conversation

Dead2
Copy link
Member

@Dead2 Dead2 commented Apr 24, 2024

s390x VM has been upgraded from RHEL 8 to Rocky Linux 9.3.
Rocky 9 comes with Clang-16 and GCC-12 by default, among other significant updates.

This PR updates the podman image and removes the need for running actions-runner using qemu x86-64 emulation.

  • New dockerfile
    • Using native actions-runner instead of relying on qemu.
    • To support s390x, we include patches to actions-runner, these might need to be updated following actions-runner changes.
    • Compiles latest tagged version of actions-runner.
    • Using Almalinux 9 instead of Ubuntu, with functional .Net.
  • Update CI workflow.
    • Split into separate jobs for s390x native and s390x emulated.
    • Now all s390x jobs will run on native when run from zlib-ng/zlib-ng.
    • Update llvm version used for MSAN jobs.
  • Update readme guide.

@Dead2
Copy link
Member Author

Dead2 commented Apr 24, 2024

I got actions-runner updated and working (tested with actual CI jobs), but then managed to break qemu-user-static.
@iii-i Help!? Your current qemu-user-static image complains it is for x86-64 instead of s390x.

I compiled my own (only took 6 hours to figure out how), but actions-runner now fails when attempting to run CI jobs.
Apr 24 16:10:23 zlib-ng01 actions-runner[28644]: not a dynamic executable

@Dead2 Dead2 requested a review from iii-i April 24, 2024 20:16
@Dead2
Copy link
Member Author

Dead2 commented Apr 24, 2024

It is kind of weird, because it seems like actions-runner sometimes succeeds at running the whole compile and many tests, but fails to report that something in the tests fails (and with it the relevant part of the test logs). The actions-runner logs show that the job failed and everything looks ok, but github shows that it is timing out waiting for data.

The not a dynamic executable error seems to only happen during actions-runner startup, but no details about what executable that would or what it is attempting to do. Not sure how that can cause the job to fail uploading later on, when it uploads the logs from the start (long after those errors), so it might be unrelated.

Might try to downgrade actions-runner to a previous version tomorrow to see whether that fixes the problem.

@Dead2
Copy link
Member Author

Dead2 commented Apr 25, 2024

Just tested downgrading actions-runner to 2.315.0, still crashes.
But I noticed now that Runner.Worker coredumps when it fails. I am guessing it probably is related to qemu-s390x emulating x86-64 to run actions-runner. Perhaps related to the bug @iii-i mentioned where qemu had problems running .Net or somesuch, but IIRC that was supposed to be fixed years ago.

I wish actions-runner supported s390x natively, unfortunately the PR to add that to actions-runner seems to have stalled 1.5 years ago.

Apr 25 05:24:19 zlib-ng01 systemd-coredump[2850]: Resource limits disable core dumping for process 2549 (Runner.Worker).
Apr 25 05:24:19 zlib-ng01 systemd-coredump[2850]: [🡕] Process 2549 (Runner.Worker) of user 1001 dumped core.
░░ Subject: Process 2549 (Runner.Worker) dumped core
░░ Defined-By: systemd
░░ Support: https://wiki.rockylinux.org/rocky/support
░░ Documentation: man:core(5)
░░
░░ Process 2549 (Runner.Worker) crashed and dumped core.
░░
░░ This usually indicates a programming error in the crashing program and
░░ should be reported to its vendor as a bug.
Apr 25 05:24:19 zlib-ng01 systemd[1]: systemd-coredump@1-2849-0.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://wiki.rockylinux.org/rocky/support
░░
░░ The unit systemd-coredump@1-2849-0.service has successfully entered the 'dead' state.

@Dead2
Copy link
Member Author

Dead2 commented Apr 25, 2024

Found an Actions Runner that should work on System Z here:
https://github.com/anup-kodlekere/gaplib

Did not test it yet, as it'd likely take me several hours to figure out how to convert our current setup to use this instead, but it is a possibility.

@iii-i
Copy link
Member

iii-i commented Apr 29, 2024

eBPF CI has a similar setup as zlib-ng (https://github.com/libbpf/ci), and they migrated to aptman/qus:d7.1 (https://github.com/libbpf/ci/blob/5618ba5cc00e40916ecddccbd5f885b29b83b68e/ansible/roles/qemu-user-static/tasks/main.yml, https://github.com/libbpf/ci/blob/5618ba5cc00e40916ecddccbd5f885b29b83b68e/ansible/roles/qemu-user-static/defaults/main.yml). I suggest to give it a try - it should be a drop-in replacement for the current image (https://github.com/dbhi/qus?tab=readme-ov-file#setup).

gaplib is also an option, but I didn't have a change to experiment with it yet.

@Dead2 Dead2 force-pushed the s390x-CI-update branch 2 times, most recently from 621bcfb to 8a9ea6f Compare April 30, 2024 17:32
@Dead2
Copy link
Member Author

Dead2 commented Apr 30, 2024

I got gaplib working, so actions-runner is running natively instead of depending on qemu.

Just needs some more iterations for fixing the last of the compile errors (missing deps).

@Dead2 Dead2 force-pushed the s390x-CI-update branch 10 times, most recently from fae1e53 to 40398d0 Compare May 1, 2024 20:14
@Dead2
Copy link
Member Author

Dead2 commented May 2, 2024

@iii-i Things are working pretty well now. Could you please do a review?

The only thing I could not figure out was how to use a volume for keeping data between runs. That means actions runner currently fails to unregister (since it has lost the identity files). When it "registers" again, it just gets the same runner-id as previously, so at least we don't end up with thousands of "runners" in the github system.
Not sure whether that needs to be fixed or not.

BTW, I noticed a few other things that we might want to improve at some point, so if anyone wants to tackle these, feel free:

  1. Cache the compiled MSAN libc++ in the github system, so we don't have to recompile it twice for every PR/push. It practically never changes anyway, so having it cached per-arch, perhaps with version-number+revision as an id so we can force an update.
  2. When compiling MSAN libc++, we currently do a shallow git clone of llvm, that takes about a whole minute. Might be faster to download and extract a tar file instead.
  3. For writing files during cmake tests, it would be nice if we could choose the path for the temporary files. This could let us write them to /tmp for example, useful for example if it is mounted as tmpfs to avoid disk writes.

@Dead2 Dead2 requested review from nmoinvaz and mtl1979 May 2, 2024 09:33
@Dead2
Copy link
Member Author

Dead2 commented May 2, 2024

I notice the conditionals for workflow jobs don't work in a matrix. Will try to fix that later, found a slightly ugly solution that supposedly works.
See: https://github.com/orgs/community/discussions/9044

What I am wondering is whether the name: line can come after the new - if: line, that example removes the name: tag entirely but I am not sure whether that is required or an unintended change.

+ <MSBuild Targets="Publish" Projects="@(ProjectFiles)" BuildInParallel="false" StopOnFirstFailure="true" Properties="Configuration=$(BUILDCONFIG);PackageRuntime=$(PackageRuntime);Version=$(RunnerVersion);$(PublishRuntimeIdentifier);PublishDir=$(MSBuildProjectDirectory)/../_layout/bin" />
<Exec Command="%22$(DesktopMSBuild)%22 Runner.Service/Windows/RunnerService.csproj /p:Configuration=$(BUILDCONFIG) /p:PackageRuntime=$(PackageRuntime) /p:OutputPath=%22$(MSBuildProjectDirectory)/../_layout/bin%22" ConsoleToMSBuild="true" Condition="'$(PackageRuntime)' == 'win-x64' Or '$(PackageRuntime)' == 'win-x86' Or '$(PackageRuntime)' == 'win-arm64'" />
</Target>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding empty line at end of file here... Some file formats don't like empty lines at end of a file. Lint warns about having more than one empty line at end of file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed, but it is not really something I can do anything about without manually editing the patch file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest solution would be to disable linting for patch files... That would also mean that patch files would need to be manually reviewed for any style errors that we actually care about.

Real solution would be that Lint would need to ignore any line not starting with `+ ´ (plus sign followed by single ASCII 32 aka SPACE character).

Copy link
Collaborator

@mtl1979 mtl1979 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to disable Lint CI run for *.patch files as those can have white-space at end of a line.

@Dead2
Copy link
Member Author

Dead2 commented May 3, 2024

We might want to disable Lint CI run for *.patch files as those can have white-space at end of a line.

I agree, ideally the linter would have understood patch file, but ignoring them would work.

@mtl1979
Copy link
Collaborator

mtl1979 commented May 3, 2024

We might want to disable Lint CI run for *.patch files as those can have white-space at end of a line.

I agree, ideally the linter would have understood patch file, but ignoring them would work.

Because patch files use white-space to separate + or - at start of line and the actual line of text in input or output file, and the line can be empty, and because last line of last patch context can be a empty line, both tests would need to be disabled.

@Dead2
Copy link
Member Author

Dead2 commented May 3, 2024

I notice the conditionals for workflow jobs don't work in a matrix. Will try to fix that later, found a slightly ugly solution that supposedly works. See: github.com/orgs/community/discussions/9044

What I am wondering is whether the name: line can come after the new - if: line, that example removes the name: tag entirely but I am not sure whether that is required or an unintended change.

I tested it, and no, it does not work either. And I had misread that link, that is actually two suggested syntaxes for the proposed possibility to be able to filter in a matrix. Can't find any information about whether it made it in or not, so probably not.
One alternative would be to have a separate workflow file for self-hosted jobs (i hope to add some more arches), and have the whole thing ignored on other repos.

I am also tempted to look at utilizing reusable workflows to allow us to have steps defined one place and included into other files, so we don't have to duplicate things. But that is way too much to do in this PR, so I think I'll have to back out some of the changes and hopefully get them properly implemented in another PR.

@Dead2 Dead2 force-pushed the s390x-CI-update branch 3 times, most recently from 0d1f1d4 to bc2b9e5 Compare May 3, 2024 16:50
@iii-i
Copy link
Member

iii-i commented May 3, 2024

Hi, thanks for handling the update - the PR looks good to me. One nit: as far as I can see it doesn't use gaplib directly, so perhaps instead of saying "based on gaplib" say "inspired by gaplib" or something along these lines?

Regarding the volume, I think it's better to drop it altogether. Ideally there should be 0 persistence across runs, in order to limit the impact of malicious PRs.

Arm,
- Arm64
+ Arm64,
+ S390x
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation, mix of spaces and tabs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the patch file we copy from gaplib, so such changes to it should instead be suggested upstream.

@nmoinvaz
Copy link
Member

nmoinvaz commented May 6, 2024

So I can't really tell, but does actions-runner now use docker for s390x?
What do forks use.. still emulation, I assume?

@Dead2
Copy link
Member Author

Dead2 commented May 6, 2024

So I can't really tell, but does actions-runner now use docker for s390x? What do forks use.. still emulation, I assume?

actions-runner is running in a docker, yes. The host is s390x, and everything now runs native s390x.

Previously actions-runner was also running in a docker, and the host was s390x, but the actions-runner application itself only supported x86_64 so it had to run with qemu emulation, while the builds it started ran natively on s390x.. It was a special kind of confusing.

@Dead2
Copy link
Member Author

Dead2 commented May 6, 2024

Rebased

@Dead2
Copy link
Member Author

Dead2 commented May 7, 2024

@iii-i I notice that after #1717 was merged, the s390x MSAN test has started failing. (It previously failed to run at all, this PR fixes that, but it now fails on uninitialized bytes.
I don't think that this should prevent this PR from being merged though, better to have a working test that fails than a test that fails to even compile.
But I am not 100% sure whether it now fails because #1717 has better testing or because clang was updated in this PR.

Uninitialized bytes in __msan_check_mem_is_initialized at offset 14 inside [0x03ffd5270506, 122)
==12255==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x2aa30258e4b in dfltcc /home/actions-runner/_work/zlib-ng/zlib-ng/arch/s390/dfltcc_detail.h:217:9
    #1 0x2aa30258e4b in dfltcc_xpnd /home/actions-runner/_work/zlib-ng/zlib-ng/arch/s390/dfltcc_inflate.c:53:10
    #2 0x2aa30258e4b in zng_dfltcc_inflate /home/actions-runner/_work/zlib-ng/zlib-ng/arch/s390/dfltcc_inflate.c:104:14
    #3 0x2aa3022ab93 in zng_inflate /home/actions-runner/_work/zlib-ng/zlib-ng/inflate.c:664:13
    #4 0x2aa301847ed in dictionary_basic_Test::TestBody() /home/actions-runner/_work/zlib-ng/zlib-ng/test/test_dict.cc:68:15
    #5 0x2aa303fb0ed in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2599:10
    #6 0x2aa303fb0ed in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2635:14
    #7 0x2aa30321ba1 in testing::Test::Run() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2674:5
    #8 0x2aa30328159 in testing::TestInfo::Run() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2853:11
    #9 0x2aa3032dd9b in testing::TestSuite::Run() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:3012:30
    #10 0x2aa303b2923 in testing::internal::UnitTestImpl::RunAllTests() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:5870:44
    #11 0x2aa303ff49d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2599:10
    #12 0x2aa303ff49d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:2635:14
    #13 0x2aa303af94b in testing::UnitTest::Run() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/src/gtest.cc:5444:10
    #14 0x2aa301f5a5f in RUN_ALL_TESTS() /home/actions-runner/_work/zlib-ng/zlib-ng/_deps/googletest-src/googletest/include/gtest/gtest.h:2293:73
    #15 0x2aa301f5a5f in main /home/actions-runner/_work/zlib-ng/zlib-ng/test/test_main.cc:21:10
    #16 0x3ff8e2b08b1 in __libc_start_call_main (/lib64/libc.so.6+0x308b1) (BuildId: 9cbb98078e41ce3e6140f3de8720a2fe85111015)
    #17 0x3ff8e2b098f in __libc_start_main@GLIBC_2.2 (/lib64/libc.so.6+0x3098f) (BuildId: 9cbb98078e41ce3e6140f3de8720a2fe85111015)
    #18 0x2aa300623ef in _start (/home/actions-runner/_work/zlib-ng/zlib-ng/gtest_zlib+0x623ef) (BuildId: 487c78ff2e532f6981a0a35a3c860ca1218d42b9)

  Uninitialized value was created by an allocation of 'compr' in the stack frame
    #0 0x2aa30182b15 in dictionary_basic_Test::TestBody() /home/actions-runner/_work/zlib-ng/zlib-ng/test/test_dict.cc:26:5

SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/actions-runner/_work/zlib-ng/zlib-ng/arch/s390/dfltcc_detail.h:217:9 in dfltcc
Exiting

test 747
        Start 747: fuzzer_example_flush

@nmoinvaz
Copy link
Member

nmoinvaz commented May 7, 2024

We might want to disable Lint CI run for *.patch files as those can have white-space at end of a line.

I agree, ideally the linter would have understood patch file, but ignoring them would work.

You could possibly add this to lint.yaml:

- name: Ignore files
  shell: bash
  run: echo "*.patch" >> .gitignore

I suggest doing it in this PR, so we can see if it works.

@iii-i
Copy link
Member

iii-i commented May 7, 2024

My bad, I tested the MSan build of #1717 without gtest. Would it be reasonable to adjust the test as follows?

--- a/test/test_dict.cc
+++ b/test/test_dict.cc
@@ -50,6 +50,8 @@ TEST(dictionary, basic) {
     err = PREFIX(deflate)(&c_stream, Z_FINISH);
     EXPECT_EQ(err, Z_STREAM_END);
 
+    compr_len = c_stream.next_out - (z_const unsigned char *)compr;
+
     err = PREFIX(deflateEnd)(&c_stream);
     EXPECT_EQ(err, Z_OK);

The zlib manual does not say anything about this, but should deflate() stop at the last valid byte and never even look at trailing garbage?

Dead2 added 2 commits May 15, 2024 13:57
- New dockerfile
  - Using native actions-runner instead of relying on qemu.
  - To support s390x, we include patches to actions-runner.
  - Using Almalinux 9 instead of Ubuntu, with functional .Net.
- Update CI workflow.
- Update readme guide.
Disable unneccessary compilation of tests, benchmarks, docs.
@Dead2
Copy link
Member Author

Dead2 commented May 15, 2024

We might want to disable Lint CI run for *.patch files as those can have white-space at end of a line.

I agree, ideally the linter would have understood patch file, but ignoring them would work.

You could possibly add this to lint.yaml:

- name: Ignore files
  shell: bash
  run: echo "*.patch" >> .gitignore

I suggest doing it in this PR, so we can see if it works.

This didn't work, not sure what went wrong there.

Edit: Well, no wonder that didn't help, *.patch is already listed in that file, so adding it another time won't help.

@Dead2
Copy link
Member Author

Dead2 commented May 15, 2024

My bad, I tested the MSan build of #1717 without gtest. Would it be reasonable to adjust the test as follows?

--- a/test/test_dict.cc
+++ b/test/test_dict.cc
@@ -50,6 +50,8 @@ TEST(dictionary, basic) {
     err = PREFIX(deflate)(&c_stream, Z_FINISH);
     EXPECT_EQ(err, Z_STREAM_END);
 
+    compr_len = c_stream.next_out - (z_const unsigned char *)compr;
+
     err = PREFIX(deflateEnd)(&c_stream);
     EXPECT_EQ(err, Z_OK);

The zlib manual does not say anything about this, but should deflate() stop at the last valid byte and never even look at trailing garbage?

I'm not sure about this, but if an application passes us a buffer and says the length is X, then we should be fine to access the whole thing. So I guess either the bug here is that the buffer is not fully initialized, or that it is specified to be too big. Perhaps initialize it so we test the case of garbage remainder?

@Dead2
Copy link
Member Author

Dead2 commented May 16, 2024

Lint fixed.
Btw, Windows MSVC CI now fail due to what looks like image changes.

@nmoinvaz
Copy link
Member

nmoinvaz commented May 16, 2024

Lint fixed.

Interesting solution.

@Dead2 Dead2 merged commit 4af7963 into develop May 17, 2024
283 of 289 checks passed
@phprus
Copy link
Contributor

phprus commented May 17, 2024

@Dead2
Fix for MSVC: #1725
Fix for MSAN: #1726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants