WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

larskanis · 2021-08-25T14:52:37Z

The aim of this PR is to make PG::Connection fully Fiber.scheduler compatible. That means that all methods that possibly could block, block through the scheduler io_wait callback. Internally we switched to async-methods in pg-1.1 for all exec methods. This was a good preparation and half of the work for this PR.

This PR follows the strategy of pg-1.1 and adds more async_* methods, renames the blocking methods to sync_* and uses PG.async_api to switch between them. async_* versions will be the default in the future.

Compatibility to ruby < 3.0 is not expected for Fiber.scheduler or Async gem compatibility (but for the rest of the gem). Planned to be merged to pg-1.3.

Fixes #342

/cc @ioquatix

ioquatix · 2021-08-25T21:48:55Z

lib/pg/connection.rb

+			if Fiber.respond_to?(:scheduler) && Fiber.scheduler
+				# If a scheduler is set use it directly.
+				# This is necessary since IO.select isn't passed to the scheduler.
+				events = Fiber.scheduler.io_wait(socket_io, IO::READABLE | IO::WRITABLE, nil)


Is it possible to use socket_io.wait here? You shouldn't need to multiplex (i.e. Fiber.respond_to?(:scheduler)) here if you use the general method.

I tried IO#wait, but although the documentation says that it returns the event mask, it returns the IO object instead. Since I have to differentiate between writable and readable, this interface is bad. Also I noticed some differences between Ruby versions that I didn't investigated deeper.

Hmm I think it should return the event mask, I’ll need to check.

lib/pg/connection.rb

ioquatix · 2021-08-25T21:52:26Z

spec/helpers/scheduler.rb

+end
+
+module Helpers
+class Scheduler


While copying the implementation is okay, I'd suggest just using a gem like async here instead.

I like neither of them. Copying is bad, since bugs like this are distributed and fixed several times in different ways. And on the other hand Async is a big gem with a whole ecosystem and ruby-2.x compatibility, although I just need a simple scheduler of around 250 lines that I'm able to understand in case of failures.

IMHO it's a faulty design to add a scheduler callback API without adding a simple reference implementation to ruby. Don't get me wrong - it's a great design to have a simple callback API which allows varying implemenations and Async is a great production ready foundation. But there should be a simple scheduler class in the stdlib of ruby. I added my thought to: https://bugs.ruby-lang.org/issues/18004#note-4

Matz wanted to incorporate Async into Ruby as a default gem but I said no. I want everyone to have equal opportunity to implement the scheduler. Maybe it was a bad choice but I want to encourage other engineers (including yourself) to create their own schedulers.

We could try to make a reference design but I think it would create soft-engineering challenges, including encouraging people to use it in a way it would not be suitable, also ossifying a particular implementation into Ruby core.

ioquatix · 2021-08-25T21:53:13Z

spec/helpers/tcp_gate_scheduler.rb

+#                     ----------------------------------------
+
+module Helpers
+class TcpGateScheduler < Scheduler


It's an interesting idea.

And it works pretty well. It showed for instance that IO#readpartial on Windows doesn't trigger the scheduler here in this failing test case. It's a cool feature that the scheduler callback interface allows "abuse" like this.

In fact, not having a reliable test method that shows me potential blocking issues in a timing insensitive way, was a blocker for me to implement this PR. How do you test for blocking issues?

I guess you cannot know for certain that some code does or does not block. But we do have one "metric" for blocking, rb_nogvl. But this isn't reliable, even if it can be useful.

ruby/ruby#4779

If we are comfortable with assuming "blocking" = "release GVL", then this will catch those cases.

spec/pg/connection_spec.rb

ioquatix

I think there is some room for improvement but it looks good so far.

One high level design issue for your consideration. The goal of async is to make asynchronous usage transparent to the user. i.e. there should not need to be a toggle or flag to control whether you use async or sync implementations. Essentially, when the fiber scheduler is not active, the operations naturally block, but when running in an asynchronous context, the operations become non-blocking. I see that you have some kind of flag to control this as well as separate implementations of the methods.

I do understand there can be performance trade-offs, but I believe it would be better if you have a single design which can handle both ways. It is almost certainly a bug to use the asynchronous interface in a non-scheduler context, and equally it's almost certainly a bug to use the synchronous interface in an asynchronous context. Even in that case, you want to avoid calling into a FFI which performs blocking operations without releasing the GVL - so even in synchronous operations, you are better off using things like IO#wait before invoking the FFI since it will allow the Ruby VM to do a better job of scheduling threads.

larskanis · 2021-08-26T09:18:46Z

@ioquatix Thank you for reviewing my PR!

I see that you have some kind of flag to control this as well as separate implementations of the methods.

I think this is misunderstood. There is only one implementation that is intended to be active - the fiber compatible one. This flag is only for debugging and comparison (performance and semantics) of nonblocking vs. blocking implementations. It is not intended for any production use (that's why it is a big global switch). I'll add some documentation to make that clear.

This design of async/sync method pairs is well proved and tested, since it was first released 3 years ago. And it helped to track down some issues with the async implementation and get it in line with the sync behavior. This PR completes the work for all potentially blocking methods beyond just exec and co.

ioquatix · 2021-08-29T21:07:22Z

spec/helpers/tcp_gate_scheduler.rb

@@ -107,15 +105,19 @@ def write(amount=5)
 							read_str = @internal_io.read_nonblock(len)
 							print_data("write fd:#{@internal_io.fileno}->#{@external_io.fileno}", read_str)
 							@external_io.write(read_str)
+							if until_writeable
+								res = IO.select(nil, [until_writeable], nil, 0)


IO.select is not compatible with the fiber scheduler FYI.

Thanks, but the timeout is zero here, so that it's always nonblocking. The mechanism is so, that in the write direction only as much data is transferred as necessary to make the observed IO writable again. This should trigger any blocking issues while writing at best.

You could just use until_writeable.wait_writable(0)?

ioquatix · 2021-09-12T20:14:21Z

lib/pg/connection.rb

+				# Use pure Ruby address resolver to avoid blocking of the scheduler.
+				# `IPSocket.getaddress` isn't fiber aware before ruby-3.1.
+				require "resolv"
+				Resolv.getaddress(host)


This is interesting for compatibility.

As happend on Windows: 1) with a Fiber scheduler can send lots of data per put_copy_data Failure/Error: super ThreadError: deadlock; recursive locking ./spec/helpers/tcp_gate_scheduler.rb:209:in `write' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait' ./spec/helpers/tcp_gate_scheduler.rb:209:in `write' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:209:in `puts' ./spec/helpers/tcp_gate_scheduler.rb:213:in `io_wait' ./lib/pg/connection.rb:405:in `wait_for_flush' ./lib/pg/connection.rb:405:in `async_put_copy_data' ./spec/pg/scheduler_spec.rb:185:in `block (5 levels) in <top (required)>' ./spec/pg/scheduler_spec.rb:184:in `times' ./spec/pg/scheduler_spec.rb:184:in `block (4 levels) in <top (required)>' ./lib/pg/connection.rb:242:in `copy_data' ./spec/pg/scheduler_spec.rb:181:in `block (3 levels) in <top (required)>' ./spec/pg/scheduler_spec.rb:52:in `block (2 levels) in run_with_scheduler'

... and respect them after accept. It can happen, that a connection is established and data is sent, before the TcpGateScheduler accepts the internal_io. In this case the events got lost and data wasn't transferred.

It is questionable how valueable Fiber.scheduler is on Windows, given that most IO doesn't go through the scheduler. But this way we satisfy our test suite at least.

This is a workaround for truffleruby < 21.3.0. The proposed upstream fix is here: oracle/truffleruby#2444 Although it works with any of the Socket classes, revert to BasicSocket, since this is the common base class of TCPSocket and UNIXSocket.

It retrieves the passowrd algorithm in a scheduler compatible way, if it isn't passed as the third parameter.

Option --disable-gvl-unlock is more understandable. Benchmarks didn't show a measurable difference between unlocking enabled/disabled, so keeping it enabled is safer. Should there still be any blocking function calls, GVL unlocking allows to run ruby threads in parallel.

It happens on Windows sometimes, when writing to stdout per puts. In this case `io` isn't a socket and treating it as such results in EBADF: Errno::EBADF: Bad file descriptor - not a socket file descriptor ./spec/helpers/tcp_gate_scheduler.rb:223:in `for_fd' ./spec/helpers/tcp_gate_scheduler.rb:223:in `io_wait' ./spec/helpers/tcp_gate_scheduler.rb:214:in `write' ./spec/helpers/tcp_gate_scheduler.rb:214:in `puts'

The resolv library resolves differently than the system resolver. In general it prefers IPv4 over IPv6. This leads to the situation, that the server socket of the TcpGateScheduler is bound to IPv6, but PG.connect tries to use IPv4 instead. This happened on Windows-10 and Windows-Server 2016 with Ruby-3.0 and with no entries in the /etc/hosts file. I tried to align the results of resolv and system library, but didn't find out how the system resolver decides between IPv4 and IPv6. So the workaround is to use a second thread and use the system resolver on Ruby-3.0.

Truffleruby-21.1.0 currently fails on Github Actions like here: https://github.com/larskanis/ruby-pg/runs/3766520041?check_suite_focus=true However it works with the same version on my local laptop, on travis-ci and with the current truffleruby-head version on Github. Since it looks like some issue that's already fixed, this commit allows truffleruby to fail on github for now.

ged · 2021-10-01T20:35:49Z

Wow what an awesome improvement! I learned a ton about Ruby 3 just from reading this PR.

We are now way past due for a release; should I put one together this weekend?

ioquatix · 2021-10-01T21:23:14Z

This is amazing well done!

cbandy · 2021-10-01T23:54:25Z

lib/pg/connection.rb

 			end
+			oopts[:hostaddr] = hostaddrs.join(",") if hostaddrs.any?


It looks like this is moving hostname lookups out of libpq which the docs say "may cause PQconnectPoll to block for a significant amount of time."

If so, it might be good to document that reset() does no lookups when it reconnects.

cbandy · 2021-10-02T00:07:45Z

lib/pg/connection.rb

+					iopts = URI.decode_www_form(uri_match['query']).to_h.transform_keys(&:to_sym)
+				end
+				# extract "host1,host2" from "host1:5432,host2:5432"
+				iopts[:host] = uri_match['hostports'].split(",").map { |hp| hp.split(":", 2)[0] }.join(",")


Does this do the right thing when the URI has an IPv6 address that contains :?

Thank you for checking this! You're right, just splitting on ":" was too opportunistic. #402 should fix this.

PQflush() changes behaviour depending on PQsetnonblocking(), so we should change it accordingly. This removes the need for the private method wait_for_flush, which did essentially the same as PQflush in blocking mode. This is a leftover of #397.

larskanis · 2021-10-04T15:54:32Z

We are now way past due for a release; should I put one together this weekend?

I would like to add #401 before and we should fix #404 and #398 first.

ged · 2021-10-04T19:44:28Z

Okay, sounds good.

drdrsh · 2023-08-26T01:38:02Z

ext/pg_connection.c

 			rb_warning( "Failed to set the default_internal encoding to %s: '%s'",
 			         encname, PQerrorMessage(conn) );
-		pgconn_set_internal_encoding_index( self );


@larskanis Sorry for resurrecting this ancient PR. But was removing pgconn_set_internal_encoding_index here intentional?

So this has the side effect of internal_encoding not being set correctly in the case the set_client_encoding call fails with an error for any reason.

This results in UTF-8 strings being interpreted as SQL_ASCII e.g

™™™™™™ becomes \xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2\xE2\x84\xA2

This not the case in version 1.2.3

ioquatix reviewed Aug 25, 2021

View reviewed changes

lib/pg/connection.rb Outdated Show resolved Hide resolved

ioquatix reviewed Aug 25, 2021

View reviewed changes

spec/pg/connection_spec.rb Outdated Show resolved Hide resolved

ioquatix reviewed Aug 25, 2021

View reviewed changes

ioquatix reviewed Aug 29, 2021

View reviewed changes

larskanis force-pushed the scheduler branch 3 times, most recently from ea99287 to c634707 Compare September 7, 2021 07:46

larskanis mentioned this pull request Sep 7, 2021

Errno::ENOTSOCK when using conn.socket_io on Windows #398

Closed

larskanis force-pushed the scheduler branch 2 times, most recently from 055f158 to ca1bc72 Compare September 10, 2021 19:14

ioquatix reviewed Sep 12, 2021

View reviewed changes

larskanis force-pushed the scheduler branch 2 times, most recently from c07b753 to ef65121 Compare September 13, 2021 14:35

larskanis mentioned this pull request Sep 13, 2021

Multiple hosts are not supported because it's not a valid URI #387

Closed

larskanis force-pushed the scheduler branch 12 times, most recently from 908c8e3 to f7cc2f1 Compare September 20, 2021 09:50

larskanis and others added 9 commits September 20, 2021 21:56

TcpGateScheduler: some refine on debug prints

1e40851

Compat with Macos

fc44071

TcpGateScheduler: Gather events on fds, that are not yet accepted

581c780

... and respect them after accept. It can happen, that a connection is established and data is sent, before the TcpGateScheduler accepts the internal_io. In this case the events got lost and data wasn't transferred.

Windows: Add a workaround for nonworking nonblocking-IO on Windows

7bdf4a4

It is questionable how valueable Fiber.scheduler is on Windows, given that most IO doesn't go through the scheduler. But this way we satisfy our test suite at least.

Implement async_encrypt_password

72d5b32

It retrieves the passowrd algorithm in a scheduler compatible way, if it isn't passed as the third parameter.

Add scheduler aware Connection.ping version

7fe6dac

larskanis force-pushed the scheduler branch from 73042e7 to 6b4cd3f Compare September 20, 2021 19:58

larskanis and others added 3 commits September 20, 2021 22:05

larskanis merged commit d7d3e8b into ged:master Oct 1, 2021

larskanis mentioned this pull request Oct 1, 2021

Improve Ruby 3.0.0 Fiber scheduler support #364

Closed

cbandy reviewed Oct 2, 2021

View reviewed changes

larskanis mentioned this pull request Oct 2, 2021

Fix IPv6 address parsing in connection URI #402

Merged

jsaak mentioned this pull request Oct 4, 2021

Yet another async feature request brianmario/mysql2#1212

Open

larskanis mentioned this pull request Jan 21, 2022

1.3.0 version slow on Windows #416

Closed

ollym mentioned this pull request Jan 24, 2022

Consider new Ruby 3.0 Fiber Scheduler bensheldon/good_job#494

Open

larskanis deleted the scheduler branch February 15, 2022 10:45

zhongxiao37 mentioned this pull request Mar 15, 2022

NameError: uninitialized constant ApplicationRecord if having two subtask for DB query socketry/async#157

Closed

jdelStrother mentioned this pull request May 3, 2022

RMagick never releases the Global VM Lock rmagick/rmagick#243

Closed

drdrsh reviewed Aug 26, 2023

View reviewed changes

drdrsh mentioned this pull request Aug 30, 2023

Call pgconn_set_internal_encoding_index in all branches of pgconn_set_default_encoding #541

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

larskanis commented Aug 25, 2021 •

edited

ioquatix Aug 25, 2021

larskanis Aug 26, 2021

ioquatix Aug 26, 2021

ioquatix Aug 25, 2021

larskanis Aug 26, 2021

ioquatix Sep 4, 2021

ioquatix Aug 25, 2021

larskanis Aug 26, 2021

ioquatix Sep 4, 2021

ioquatix left a comment •

edited

larskanis commented Aug 26, 2021

ioquatix Aug 29, 2021

larskanis Aug 30, 2021

ioquatix Sep 4, 2021

ioquatix Sep 12, 2021

ged commented Oct 1, 2021

ioquatix commented Oct 1, 2021

cbandy Oct 1, 2021

cbandy Oct 2, 2021

larskanis Oct 2, 2021

larskanis commented Oct 4, 2021

ged commented Oct 4, 2021

drdrsh Aug 26, 2023

WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

WIP: Add full compatibility to Fiber.scheduler of Ruby-3.0 #397

Conversation

larskanis commented Aug 25, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ioquatix left a comment • edited

Choose a reason for hiding this comment

larskanis commented Aug 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ged commented Oct 1, 2021

ioquatix commented Oct 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larskanis commented Oct 4, 2021

ged commented Oct 4, 2021

Choose a reason for hiding this comment

larskanis commented Aug 25, 2021 •

edited

ioquatix left a comment •

edited