Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

Merged
merged 62 commits into from
Oct 1, 2021

Conversation

larskanis
Copy link
Collaborator

@larskanis larskanis commented Aug 25, 2021

The aim of this PR is to make PG::Connection fully Fiber.scheduler compatible. That means that all methods that possibly could block, block through the scheduler io_wait callback. Internally we switched to async-methods in pg-1.1 for all exec methods. This was a good preparation and half of the work for this PR.

This PR follows the strategy of pg-1.1 and adds more async_* methods, renames the blocking methods to sync_* and uses PG.async_api to switch between them. async_* versions will be the default in the future.

Compatibility to ruby < 3.0 is not expected for Fiber.scheduler or Async gem compatibility (but for the rest of the gem). Planned to be merged to pg-1.3.

  • Use a triggered interceptor for the TCP connection that acts as a gate to transfer data to the PostgreSQL server and back. This shall detect any functions that don't properly trigger the scheduler.
  • Establish connection through the scheduler
  • Works on Windows (IO#readpartial doesn't call the scheduler -> use read_nonblock instead, PG::Connection#block implemented in ruby)
  • keep external nonblocking API fully compatible (while switching to nonblocking API for all internals in this PR)
  • add tests for external nonblocking API to verify compatibility
  • async_send methods (async_send_query_prepared etc.)
  • async_exec methods (async_prepare etc.)
  • async_put_copy_data, async_put_copy_end
  • async_get_copy_data
  • async_get_result, async_get_last_result
  • discard_results
  • async_reset reset the backend connection
  • async_set_client_encoding
  • async_cancel
  • PQencryptPasswordConn can block if no algorithm is given
  • PQping - impossible to implement with nonblocking IO -> uses thread and releases GVL
  • PQconnectStartParams can block if option hostaddr isn't set
  • remove GVL unlocking: since every blocking operation is in ruby space now, we could remove GVL unlocking from libpq-functions. One downside is, that the TcpGateScheduler can not abort a blocking operation in case of a CI failure. -> There's a gem-install-switch now, which allows to disable on quest: --disable-gvl-unlock
  • compare performance without scheduler -> doesn't show any performance regressions
  • move new ruby code to C to avoid performance regressions

Fixes #342

/cc @ioquatix

if Fiber.respond_to?(:scheduler) && Fiber.scheduler
# If a scheduler is set use it directly.
# This is necessary since IO.select isn't passed to the scheduler.
events = Fiber.scheduler.io_wait(socket_io, IO::READABLE | IO::WRITABLE, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use socket_io.wait here? You shouldn't need to multiplex (i.e. Fiber.respond_to?(:scheduler)) here if you use the general method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried IO#wait, but although the documentation says that it returns the event mask, it returns the IO object instead. Since I have to differentiate between writable and readable, this interface is bad. Also I noticed some differences between Ruby versions that I didn't investigated deeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it should return the event mask, I’ll need to check.

lib/pg/connection.rb Outdated Show resolved Hide resolved
end

module Helpers
class Scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While copying the implementation is okay, I'd suggest just using a gem like async here instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like neither of them. Copying is bad, since bugs like this are distributed and fixed several times in different ways. And on the other hand Async is a big gem with a whole ecosystem and ruby-2.x compatibility, although I just need a simple scheduler of around 250 lines that I'm able to understand in case of failures.

IMHO it's a faulty design to add a scheduler callback API without adding a simple reference implementation to ruby. Don't get me wrong - it's a great design to have a simple callback API which allows varying implemenations and Async is a great production ready foundation. But there should be a simple scheduler class in the stdlib of ruby. I added my thought to: https://bugs.ruby-lang.org/issues/18004#note-4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matz wanted to incorporate Async into Ruby as a default gem but I said no. I want everyone to have equal opportunity to implement the scheduler. Maybe it was a bad choice but I want to encourage other engineers (including yourself) to create their own schedulers.

We could try to make a reference design but I think it would create soft-engineering challenges, including encouraging people to use it in a way it would not be suitable, also ossifying a particular implementation into Ruby core.

# ----------------------------------------

module Helpers
class TcpGateScheduler < Scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it works pretty well. It showed for instance that IO#readpartial on Windows doesn't trigger the scheduler here in this failing test case. It's a cool feature that the scheduler callback interface allows "abuse" like this.

In fact, not having a reliable test method that shows me potential blocking issues in a timing insensitive way, was a blocker for me to implement this PR. How do you test for blocking issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you cannot know for certain that some code does or does not block. But we do have one "metric" for blocking, rb_nogvl. But this isn't reliable, even if it can be useful.

ruby/ruby#4779

If we are comfortable with assuming "blocking" = "release GVL", then this will catch those cases.

Copy link
Contributor

@ioquatix ioquatix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some room for improvement but it looks good so far.

One high level design issue for your consideration. The goal of async is to make asynchronous usage transparent to the user. i.e. there should not need to be a toggle or flag to control whether you use async or sync implementations. Essentially, when the fiber scheduler is not active, the operations naturally block, but when running in an asynchronous context, the operations become non-blocking. I see that you have some kind of flag to control this as well as separate implementations of the methods.

I do understand there can be performance trade-offs, but I believe it would be better if you have a single design which can handle both ways. It is almost certainly a bug to use the asynchronous interface in a non-scheduler context, and equally it's almost certainly a bug to use the synchronous interface in an asynchronous context. Even in that case, you want to avoid calling into a FFI which performs blocking operations without releasing the GVL - so even in synchronous operations, you are better off using things like IO#wait before invoking the FFI since it will allow the Ruby VM to do a better job of scheduling threads.

@larskanis
Copy link
Collaborator Author

@ioquatix Thank you for reviewing my PR!

I see that you have some kind of flag to control this as well as separate implementations of the methods.

I think this is misunderstood. There is only one implementation that is intended to be active - the fiber compatible one. This flag is only for debugging and comparison (performance and semantics) of nonblocking vs. blocking implementations. It is not intended for any production use (that's why it is a big global switch). I'll add some documentation to make that clear.

This design of async/sync method pairs is well proved and tested, since it was first released 3 years ago. And it helped to track down some issues with the async implementation and get it in line with the sync behavior. This PR completes the work for all potentially blocking methods beyond just exec and co.

@@ -107,15 +105,19 @@ def write(amount=5)
read_str = @internal_io.read_nonblock(len)
print_data("write fd:#{@internal_io.fileno}->#{@external_io.fileno}", read_str)
@external_io.write(read_str)
if until_writeable
res = IO.select(nil, [until_writeable], nil, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IO.select is not compatible with the fiber scheduler FYI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but the timeout is zero here, so that it's always nonblocking. The mechanism is so, that in the write direction only as much data is transferred as necessary to make the observed IO writable again. This should trigger any blocking issues while writing at best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just use until_writeable.wait_writable(0)?

# Use pure Ruby address resolver to avoid blocking of the scheduler.
# `IPSocket.getaddress` isn't fiber aware before ruby-3.1.
require "resolv"
Resolv.getaddress(host)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting for compatibility.

larskanis and others added 9 commits September 20, 2021 21:56
As happend on Windows:
  1) with a Fiber scheduler can send lots of data per put_copy_data
     Failure/Error: super
     ThreadError:
       deadlock; recursive locking
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `write'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `write'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts'
      ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait'
      ./lib/pg/connection.rb:405:in `wait_for_flush'
      ./lib/pg/connection.rb:405:in `async_put_copy_data'
      ./spec/pg/scheduler_spec.rb:185:in `block (5 levels) in <top (required)>'
      ./spec/pg/scheduler_spec.rb:184:in `times'
      ./spec/pg/scheduler_spec.rb:184:in `block (4 levels) in <top (required)>'
      ./lib/pg/connection.rb:242:in `copy_data'
      ./spec/pg/scheduler_spec.rb:181:in `block (3 levels) in <top (required)>'
      ./spec/pg/scheduler_spec.rb:52:in `block (2 levels) in run_with_scheduler'
... and respect them after accept.

It can happen, that a connection is established and data is sent, before the TcpGateScheduler accepts the internal_io.
In this case the events got lost and data wasn't transferred.
It is questionable how valueable Fiber.scheduler is on Windows, given that most IO doesn't go through the scheduler.
But this way we satisfy our test suite at least.
This is a workaround for truffleruby < 21.3.0.
The proposed upstream fix is here: oracle/truffleruby#2444

Although it works with any of the Socket classes, revert to BasicSocket, since this is the common base class of TCPSocket and UNIXSocket.
It retrieves the passowrd algorithm in a scheduler compatible way, if it isn't passed as the third parameter.
Option --disable-gvl-unlock is more understandable.

Benchmarks didn't show a measurable difference between unlocking enabled/disabled, so keeping it enabled is safer.
Should there still be any blocking function calls, GVL unlocking allows to run ruby threads in parallel.
larskanis and others added 3 commits September 20, 2021 22:05
It happens on Windows sometimes, when writing to stdout per puts.
In this case `io` isn't a socket and treating it as such results in EBADF:

     Errno::EBADF:
       Bad file descriptor - not a socket file descriptor
       ./spec/helpers/tcp_gate_scheduler.rb:223:in `for_fd'
       ./spec/helpers/tcp_gate_scheduler.rb:223:in `io_wait'
       ./spec/helpers/tcp_gate_scheduler.rb:214:in `write'
       ./spec/helpers/tcp_gate_scheduler.rb:214:in `puts'
The resolv library resolves differently than the system resolver.
In general it prefers IPv4 over IPv6.
This leads to the situation, that the server socket of the TcpGateScheduler is bound to IPv6, but PG.connect tries to use IPv4 instead.

This happened on Windows-10 and Windows-Server 2016 with Ruby-3.0 and with no entries in the /etc/hosts file.

I tried to align the results of resolv and system library, but didn't find out how the system resolver decides between IPv4 and IPv6.

So the workaround is to use a second thread and use the system resolver on Ruby-3.0.
Truffleruby-21.1.0 currently fails on Github Actions like here: https://github.com/larskanis/ruby-pg/runs/3766520041?check_suite_focus=true
However it works with the same version on my local laptop, on travis-ci and with the current truffleruby-head version on Github.
Since it looks like some issue that's already fixed, this commit allows truffleruby to fail on github for now.
@larskanis larskanis merged commit d7d3e8b into ged:master Oct 1, 2021
@ged
Copy link
Owner

ged commented Oct 1, 2021

Wow what an awesome improvement! I learned a ton about Ruby 3 just from reading this PR.

We are now way past due for a release; should I put one together this weekend?

@ioquatix
Copy link
Contributor

ioquatix commented Oct 1, 2021

This is amazing well done!

end
oopts[:hostaddr] = hostaddrs.join(",") if hostaddrs.any?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is moving hostname lookups out of libpq which the docs say "may cause PQconnectPoll to block for a significant amount of time."

If so, it might be good to document that reset() does no lookups when it reconnects.

iopts = URI.decode_www_form(uri_match['query']).to_h.transform_keys(&:to_sym)
end
# extract "host1,host2" from "host1:5432,host2:5432"
iopts[:host] = uri_match['hostports'].split(",").map { |hp| hp.split(":", 2)[0] }.join(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do the right thing when the URI has an IPv6 address that contains :?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking this! You're right, just splitting on ":" was too opportunistic. #402 should fix this.

larskanis added a commit that referenced this pull request Oct 4, 2021
PQflush() changes behaviour depending on PQsetnonblocking(), so we should change it accordingly.
This removes the need for the private method wait_for_flush, which did essentially the same as PQflush in blocking mode.

This is a leftover of #397.
@larskanis
Copy link
Collaborator Author

We are now way past due for a release; should I put one together this weekend?

I would like to add #401 before and we should fix #404 and #398 first.

@ged
Copy link
Owner

ged commented Oct 4, 2021

Okay, sounds good.

rb_warning( "Failed to set the default_internal encoding to %s: '%s'",
encname, PQerrorMessage(conn) );
pgconn_set_internal_encoding_index( self );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larskanis Sorry for resurrecting this ancient PR. But was removing pgconn_set_internal_encoding_index here intentional?

So this has the side effect of internal_encoding not being set correctly in the case the set_client_encoding call fails with an error for any reason.

This results in UTF-8 strings being interpreted as SQL_ASCII e.g

™™™™™™
becomes
\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2

This not the case in version 1.2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pg more Ruby-3.0 fiber scheduler friendly
5 participants