Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruby 3.2] having runtime issues on darwin #87

Closed
flavorjones opened this issue Dec 19, 2022 · 40 comments
Closed

[Ruby 3.2] having runtime issues on darwin #87

flavorjones opened this issue Dec 19, 2022 · 40 comments

Comments

@flavorjones
Copy link
Collaborator

@larskanis I'm hoping you can help me out here, I'm not sure what's going wrong in the toolchain.

I've built a version of Nokogiri that precompiles Ruby 3.2 shared objects using the docker images published here. (The pipeline to build and test Nokogiri is here if you're curious, and the PR is here).

For both x86_64-darwin and arm64-darwin (the latter I run manually on my dev machine), I see this runtime error:

/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/extension.rb:7:in `require_relative': dlopen(/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle, 0x0009): Symbol not found: (_rb_cObject) (LoadError)
  Referenced from: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle'
  Expected in: '/Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/bin/ruby' - /Users/runner/hostedtoolcache/Ruby/3.2.0-rc1/x64/lib/ruby/gems/3.2.0+3/gems/nokogiri-1.15.test.2022.1219.1745-x86_64-darwin/lib/nokogiri/3.2/nokogiri.bundle

You can see this in the CI logs here. You can download this gem yourself if you like from the "Artifacts" section of this pipeline run:

Comparing the 3.2 dylib and the 3.1 dylib (which still works), both have _rb_cObject listed as Undefined, so I'm not actually sure the problem is in how the shared object is built. Do you have any ideas?

@flavorjones
Copy link
Collaborator Author

Interesting: when I install 3.2.0-rc1 using ruby-build then that version of Ruby is able to run this gem fine. So maybe this is a problem with how ruby/setup-ruby is building 3.2.0-rc1?

@larskanis
Copy link
Member

The _rb_cObject should be in libruby, but libruby isn't mentioned in the above error output. I guess it has something to do with this commit ruby/ruby@50d81bf . On Macos we use the same trick like on Linux, that unresolved symbols are automatically resolved to already loaded libraries. Therefore the nokogiri.bundle doesn't link to libruby explicit, but is still resolved at runtime.

@flavorjones
Copy link
Collaborator Author

Looks like if I build ruby with --enable-shared then this happens at runtime. When I build without that flag, ruby loads the bundle fine.

@larskanis
Copy link
Member

Here was a similar issue with nokogiri on Macos: sparklemotion/nokogiri#2063 (comment)

@flavorjones
Copy link
Collaborator Author

Wow, thanks for refreshing my memory, I have very little recollection of doing that work. 🤷

So at this point, I think the 3.2 bundle is fine ... do you agree?

I want to understand better what's changed about the Ruby build system, but as long as --disable-shared (which is implicit by default) is set, users should be fine. Do you agree?

@flavorjones
Copy link
Collaborator Author

Tagging @kateinoigakukun in case he spots something that's wrong with either the rake-compiler-dock build process or Ruby's makefiles.

@kateinoigakukun
Copy link

kateinoigakukun commented Dec 20, 2022

@flavorjones Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa.

I think the compatibility has not been guaranteed but it just worked so far. The bundle_loader change changed the situation, so we can't assume that fat gems are portable among rubies configured with different options.

@flavorjones
Copy link
Collaborator Author

@kateinoigakukun Thanks for confirming.

@larskanis Do you know why the 3.2.0-rc1 ruby binaries installed by ruby/setup-ruby aren't compatible? I looked in the repo but nothing jumped out at me.

@larskanis
Copy link
Member

@flavorjones Which ruby binaries aren't compatible with what? Do you have a CI link?

@flavorjones
Copy link
Collaborator Author

flavorjones commented Dec 20, 2022

@larskanis Darwin 3.2.0-rc1 binaries aren't compatible with the extensions being built by rake-compiler-dock.

CI link: https://github.com/sparklemotion/nokogiri/actions/runs/3735058984/jobs/6350173407

The CI config is:

  cruby-x86_64-darwin-install:
    needs: ["cruby-package"]
    strategy:
      fail-fast: false
      matrix:
        ruby: ["2.7", "3.0", "3.1", "3.2.0-rc1"]
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v3
        with:
          submodules: true
      - uses: ruby/setup-ruby@v1
        with:
          ruby-version: "${{matrix.ruby}}"
      - uses: actions/download-artifact@v3
        with:
          name: cruby-x86_64-darwin-gem
          path: gems
      - run: ./scripts/test-gem-install gems

and it's passing for the other Ruby versions.

@flavorjones
Copy link
Collaborator Author

#90 reproduces this error in rake-compiler-dock's CI pipeline

@flavorjones
Copy link
Collaborator Author

@larskanis I confirmed that the rubies built by ruby-builder (and downloadable from here) are built with --enable-shared.

Not sure what you think we should do here. I just pushed #91 to see what will break if we set --enable-shared on darwin.

@flavorjones
Copy link
Collaborator Author

Unfortunately, an extension library compiled with a --enable-shared'ed ruby is not compatible with a --disable-shared'ed ruby, and vice versa

@kateinoigakukun This change may have unintended consequences. Over the past two years, nokogiri precompiled for darwin has been downloaded 12,062,659 times, so I'm very worried that if we can no longer ship precompiled binaries for darwin that this is going to be a burden on both users and gem maintainers.

Is the recommended workaround to ship two bundles for darwin? One for rubies configured with --enable-shared and one for rubies configured with --disable-shared?

@kateinoigakukun
Copy link

@flavorjones I've committed a patch to restore the compatibility nokogiri has been expecting so far: "An extension built with --disable-shared Ruby is loadable from --enable-shared Ruby".
ruby/ruby@c5eefb7

I think it would repair your build and you don't need to distribute two dylibs for darwin now :)

@flavorjones
Copy link
Collaborator Author

@kateinoigakukun Thank you! I will run that change through the rake-compiler-dock test suite!

@flavorjones
Copy link
Collaborator Author

flavorjones commented Dec 23, 2022

Here's how I tested: I'm still using 3.2.0-rc1 as the cross-compiling Ruby, but added append_ldflags("-Wl,-flat_namespace") to Nokogiri's extconf.rb.

The good news is that a gem precompiled with that flag does run on darwin both with a "enable-shared ruby" and a "disable-shared ruby".

However, the bad news is that the libxml2 and libxslt symbols are no longer being resolved from within the bundle -- they are resolving against system libraries! I see these warnings:

WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
Here's the complete output from `nokogiri -v`
$ nokogiri -v
WARNING: Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
WARNING: Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
# Nokogiri (1.14.0.dev)
    ---
    warnings:
    - Nokogiri was built against libxml version 2.10.3, but has dynamically loaded 2.9.13
    - Nokogiri was built against libxslt version 1.1.37, but has dynamically loaded 1.1.35
    nokogiri:
      version: 1.14.0.dev
      cppflags:
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri"
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.2.0
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 3.2.0dev (2022-12-06T05:55:05Z v3_2_0_rc1 81e274c990) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/Users/flavorjones/.gem/ruby/3.2.0/gems/nokogiri-1.14.0.dev-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.10.3
      loaded: 2.9.13
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.37
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

These warnings are generated by comparing the version of libxml2 present in LIBXML_DOTTED_VERSION at compile time against the value present in the global variable xmlParserVersion at runtime.

They should always match, but in gems built this way, xmlParserVersion is being resolved from system libraries.

Looking at the symbols present in the bundle, the libxml2 symbols are properly in the text (code) section, denoted with a "T":

/opt/osxcross/target/bin/arm64-apple-darwin-nm nokogiri.bundle | fgrep xmlParserVersion
00000000000c21d0 T ___xmlParserVersion
00000000002aaa10 D _xmlParserVersion

@kateinoigakukun Do you understand why the bundle isn't being used as the primary location for resolving symbols? Should I play with -fvisibility=hidden and __attribute__((visibility("default")))?

flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Dec 23, 2022
@flavorjones
Copy link
Collaborator Author

I've pushed the workaround mentioned in my previous comment here, in case anyone needs to reproduce:

sparklemotion/nokogiri#2732

I'll link to the specific warnings when CI gets there.

flavorjones added a commit to sparklemotion/nokogiri that referenced this issue Dec 23, 2022
@flavorjones
Copy link
Collaborator Author

Hmm, very strange, it works fine in the actions runner: https://github.com/sparklemotion/nokogiri/actions/runs/3767005588/jobs/6404353192#step:5:36

Maybe this is close enough for me to cut a release candidate of Nokogiri and get feedback from people.

@flavorjones
Copy link
Collaborator Author

OK, I'm not sure why I can't reproduce this in github runners, but I've reproduced now the symbol-loading error on three different macs.

When I run vmmap on a ruby process, I can see a /usr/lib/libxml2.2.dylib entry which I believe is mapped to /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/usr/lib/ (the xcode CLT libraries). (This is also true for 3.1 rubies, so nothing new there.)

But I believe what is new is that now function calls from the nokogiri.bundle are no longer resolved to symbols that exist in the bundle, but to symbols in the ruby process. This makes sense given the explanation of the -flat_namespace change.

I'm not sure how to get unblocked from here.

@flavorjones
Copy link
Collaborator Author

flavorjones commented Dec 23, 2022

OK, I made some progress with trying -load_hidden when I add the static libraries for libxml2 and libxslt to the bundle. I might be unblocked.

See sparklemotion/nokogiri@1714e77 for what seems to work.

@flavorjones
Copy link
Collaborator Author

flavorjones commented Dec 24, 2022

Well, -load_hidden doesn't completely work, either -- it means all symbols are hidden and that breaks the ABI of the extensions which some other gems (like nokogiri-xmlsec) depend on (nokogiri-xmlsec needs to use the same libxml2 library that nokogiri uses).

@kateinoigakukun This all seems really bad. See some conversation with another nokogiri maintainer at sparklemotion/nokogiri#2732 where we're proposing to try to undo these changes when we build the extension.

@flavorjones
Copy link
Collaborator Author

@kateinoigakukun Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

@nurse
Copy link

nurse commented Dec 25, 2022

I reverted katei's fix with -flat_namespace from Ruby 3.2.0. I continue to discuss how to fix dynamic_lookup problem toward Ruby 3.2.1.

Please see https://github.com/stevecheckoway/bundle_test for a working example of why -flat_namespace is not a sufficient solution.

Thank you for clarifying the problem. What I am wondering now is that is the problem is really the problem?

I fixed the files as below and its result is as the same as flat_namespace. Therefore if a gem cause a problem with flat_namespace, it doesn't work on Linux.

foo_version(): 1
init()
function_in_main()
foo_version(): 1
diff --git a/Makefile b/Makefile
index 9f6f601..8df8164 100644
--- a/Makefile
+++ b/Makefile
@@ -10,29 +10,29 @@ endif

 .PHONY: all clean

-all: main main_shared extension.bundle
+all: main main_shared extension.so

-main_shared: main.c libmain.dylib libfoo1.dylib
+main_shared: main.c libmain.so libfoo1.so
        $(CC) -o $@ -DENABLE_SHARED=1 main.c -L. -lmain -lfoo1

-main: main.c libfoo1.dylib
-       $(CC) -o $@ main.c -L. -lfoo1
+main: main.c libfoo1.so
+       $(CC) -o $@ main.c -L. -rdynamic -ldl -lfoo1

-libmain.dylib: main.c libfoo1.dylib
-       $(CC) -dynamiclib -DLIBMAIN=1 -o $@ main.c -L. -lfoo1
+libmain.so: main.c libfoo1.so
+       $(CC) -shared -fPIC -DLIBMAIN=1 -o $@ main.c -L. -ldl -lfoo1

-libfoo1.dylib: foo.c
-       $(CC) -dynamiclib -DFOO_VERSION=1 -o $@ $<
+libfoo1.so: foo.c
+       $(CC) -shared -fPIC -DFOO_VERSION=1 -o $@ $<

-extension.bundle: extension.c libfoo2.a
-       $(CC) -o $@ -bundle $(NAMESPACE_OPTS) extension.c libfoo2.a
+extension.so: extension.c libfoo2.a
+       $(CC) -o $@ -shared -fPIC extension.c libfoo2.a

 foo2.o: foo.c
-       $(CC) -c -o $@ -DFOO_VERSION=2 $^
+       $(CC) -c -fPIC -o $@ -DFOO_VERSION=2 $^

 libfoo2.a: foo2.o
        $(RM) $@
        ar -cr $@ $^

 clean:
-       $(RM) main main_shared *.bundle *.dylib *.o *.a
+       $(RM) main main_shared *.bundle *.so *.o *.a
diff --git a/main.c b/main.c
index 30ba3fe..946fe9c 100644
--- a/main.c
+++ b/main.c
@@ -14,7 +14,7 @@ static
 #endif
 int main_implementation() {
     // Load the library.
-    void *handle = dlopen("extension.bundle", RTLD_LAZY | RTLD_GLOBAL);
+    void *handle = dlopen("extension.so", RTLD_LAZY | RTLD_GLOBAL);
     if (!handle) {
         fprintf(stderr, "dlopen(): %s", dlerror());
         return 1;

@kateinoigakukun
Copy link

Thank you all 🙏 Will have a closer look next month.

@flavorjones
Copy link
Collaborator Author

I've written up a brief summary of how Nokogiri is working around this at nokogiri/2022-12-darwin-symbol-resolution.md at main · sparklemotion/nokogiri · GitHub

@flavorjones
Copy link
Collaborator Author

flavorjones commented Jan 2, 2023

OK, after shipping an RC for nokogiri native gems, and working on sqlite3-ruby native gems, it looks like everybody who precompiles for Darwin and Ruby 3.2 will need to add something like this to their extconf:

          if cross_build? &&
             darwin? &&
             RbConfig::CONFIG["ruby_version"] >= "3.2"
            append_ldflags("-Wl,-flat_namespace")
          end

@kateinoigakukun Do you have any ideas on how to make this situation better, potentially in a future Ruby 3.2.1 release?

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

@larskanis
Copy link
Member

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

@flavorjones
Copy link
Collaborator Author

@larskanis Do you think we should try to add this flag into the link line for users from within rake-compiler-dock? Possibly by modifying the DLDFLAGS in the rake-compiler rbconfig.rb files?

Yes, I think so. We shouldn't demand these lines.

OK, will put together a PR

@flavorjones
Copy link
Collaborator Author

If we merge #94, then I think we should cut a release soon.

Then I can confidently document how to deal with global symbol resolution in https://github.com/flavorjones/ruby-c-extensions-explained. (I think we need to start recommending that everyone build their extension (and all libraries that are statically linked into it) with -fvisibility=hidden and then only export the Init_* function, like I did in sparklemotion/sqlite3-ruby#371.)

@flavorjones
Copy link
Collaborator Author

I've merged #94 and I think we can cut a release. I'm building another rc for sqlite3-ruby based on images generated from ce619f2 and they look good.

@larskanis I think you're still the only owner of the dockerhub account, but if there is anything I can do to help with a release, please let me know?

@larskanis
Copy link
Member

@flavorjones As far as I can see, there is no way to add other contributors to the docker hub account. The only way is to convert "larskanis" to an organization, which doesn't feel like a good solution. Instead I think it makes sense to create a new organization and publish the images there? Maybe the name "rake-compiler" like on Github? Then we can change the name in the gem as well.

@larskanis
Copy link
Member

Nevertheless I can do the release and publish the images on the "larskanis" account.

I did some tests on my side (with fxruby, libusb, ffi and ruby-pg) and the latest images worked well. I found only one issue with RbConfig::CONFIG['host'], which returns x64-w64-mingw32 on ruby-3.2 on x64-mingw-ucrt instead of the x86_64-w64-mingw32 of ruby-3.1 and so it differs from the compiler name prefix. But I found that the host variable is already different on the linux images, so that I changed the code in fxruby to work with the latest images.

@flavorjones
Copy link
Collaborator Author

@larskanis I'd like to again propose that we use ghcr.io instead of dockerhub as the future home of these images. I started this work to demonstrate it could work with the weekly snapshots. We could similarly automate creation of the versioned images (and we can cache layers to ensure that image diffs stay small between patch releases).

@flavorjones
Copy link
Collaborator Author

@larskanis Would it be helpful if I did the work to cut over to ghcr.io to be the primary registry for RCD images so that I'm able to cut a release? You could then sync those images to dockerhub for convenience for folks. I should have some time this week to do that if you're not able to cut a release.

@larskanis
Copy link
Member

I didn't work with ghcr.io so far, so I'd be happy if you could prepare it. I don't care too much about the platform we use to distribute our images. So if ghcr.io fulfills our needs, I think we can switch the gem to it with no need to mirror the images on docker hub.

@flavorjones
Copy link
Collaborator Author

See #95! ❤️

@flavorjones
Copy link
Collaborator Author

Closing this, I think we have a path forward for Ruby 3.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@flavorjones @nurse @larskanis @kateinoigakukun and others