Create new worker processes in separate memory spaces from master process #2407

cjlarose · 2020-10-02T23:44:59Z

Description

Fixes #2018

A fairly common and reasonable deployment strategy has been broken since the introduction of nio4r in puma 4.0. Imagine that a user is using puma in cluster mode. In order to deploy a new version of an application, a user's deployment tool does this:

Add new version of the application's source to the server in a new directory
Update the "current release" symlink to point to the new release directory (the symlink is configured as puma's directory in the config file)
Issue a phased restart
After verifying that the release was successful, delete the old version of the application from disk

During the next deployment, workers will fail to start. They'll encounter the error Could not find nio4r-2.5.1 in any of the sources (Bundler::GemNotFound). This is, in part, due to a quirk in Bundler where Bundler will raise this error if you're trying to activate a gem whose native extensions are no longer present on disk (even if the gem is already loaded into memory). However, this exception reveals a deeper problem related to the intimate relationship between the gems used by the puma master process and those used by the puma worker processes.

Users can, in part, separate the gems used in workers from those used in the puma master by turning on prune_bundler. That option will "eject" the puma master process from the full Bundler context, re-activating only the small number of gems used by the puma master process, including puma, nio4r, and gems added to extra_runtime_dependencies. The GemNotFound issue, however reveals that even prune_bundler isn't always sufficient in severing the relationship between gems used by the master and worker processes. In addition to being unable to delete gems from old releases, the existing prune_bundler-based solution doesn't allow a user to upgrade the version of nio4r or any gem in extra_runtime_dependencies with a phased restart.

The solution proposed in this PR is to place worker processes in a completely separate memory space. The proposed change uses spawn to spin up new workers (this is essentially a fork followed by an exec on Unix-like systems). Since workers use a separate memory space, the gems they load can be totally different than those loaded in the puma master process. The drawback, of course, is that this spawn-based stategy doesn't allow one to take advantage of COW memory between the puma master and the puma workers. For this reason, we still use the old fork method in two cases:

If preload_app is enabled, since if preload_app is enabled, phased restarts aren't supported anyway. This option behaves as it did before: it forgoes the ability to perform phased restarts in exchange for minimizing the amount of memory used by fully leveraging the benefit of COW memory.
When workers fork from other workers (when fork_worker is enabled). Similar to preload_app, the main benefit of this option is to reduce memory by leveraging COW memory. The new spawn-based strategy is used only when creating worker 0, the process from which other workers fork. Subsequent workers are forked in the traditional way from worker 0.

What does this mean for prune_bundler? Before this change, it was necessary to use prune_bundler if you wanted phased restarts and the ability to change gem versions between different releases of your application. After this change, prune_bundler won't be necessary in this case. prune_bundler will essentially just be a memory optimization for the puma master process since it'll shed some gems from the puma master.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] or [ci skip] to the pull request title.
I have added appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

MSP-Greg · 2020-10-03T01:11:21Z

Re last CI run, the only MRI failure (ubuntu-18.04 2.6) was due to incorrect code, but never noticed before. See b17579010ab. Due to timing & luck, it's passed quite a bit previously...

cjlarose · 2020-10-03T04:17:44Z

@MSP-Greg Thanks for that! I rebased to include your commit.

...in clusters that support phased restarts. If worker processes are created from the puma master process with just a `fork`, then any gems loaded by the puma master are already "activated" in the workers. This is mitigated in part through the use of the `prune_bundler` configuration option, which ejects the puma master from the application bundler context on startup, unloading all gems except those used by the puma master process (typically, that's nio4r as well as anything specified in `extra_runtime_dependencies`). A fairly typical release strategy demonstrates the incompleteness of `prune_bundler`. Imagine that in order to release a new version of an application, users add the new version to the server, change a release symlink to point the new version, initiate a phased restart on the cluster, and then delete any old version of the application from disk (including the gems). When a worker starts up and executes `Bundler.setup`, it'll try to verify that each gem with native extensions has been compiled by inspecting the gem directory on disk. Since workers necessarily share the gems activated by the puma master, Bundler will look for the gems on disk asscociated with the release that was active at the time the puma master process started. If those gems have since been deleted, Bundler raises `GemNotFound` and the workers fail to boot. This happens most frequently with the nio4r gem since it is 1) loaded by the puma master process (even when `prune_bundler` is enabled), and 2) it has native extensions. But the same can happen to any gem that matches those two conditions, such as the `ffi` gem if the puma master is configured to use `puma_worker_killer`. The fix introduced in this commit creates workers using `spawn` instead of `fork`. `spawn` creates a new subprocess with a separate memory space (`spawn` uses `fork` then `exec` on Unix-like systems). The result is that workers share no gems with the puma master. Workers can therefore start up successfully even if the master's gems have been removed from disk. It's also possible to use different versions of gems in the workers than what's loaded into the master. The cost is some additional memory usage since workers that fork from the master process can't leverage COW memory optimization. However, the old `fork`-based method is still used in two cases: 1) If `preload_app` is enabled, since if `preload_app` is enabled, phased restarts aren't supported anyway. This option behaves as it did before: it forgoes the ability to perform phased restarts but maximizes the opportunity to use COW memory. 2) When workers fork from other workers (when `fork_worker` is enabled). Similar to `preload_app`, the main benefit of this option is to reduce memory by leveraging COW memory. The new spawn-based strategy is used only when creating `worker 0`, the process from which other workers fork.

Workers created by `fork`ing from the puma master process inherit all of the signal handlers established by in `Launcher#setup_signals`. But since workers can now be created via `spawn`, those workers do not automatically inherit those signal handlers, so we set up SIGINFO here explicitly. Other signal handlers are established already by `Worker#run`.

Because workers are created in a new process image via `spawn`, they can't easily read an in-memory Configuration object in the puma master process. Instead, the workers need to read the configuration from the command-line args or from a config file.

The original motivation for `extra_runtime_dependencies` was to activate gems in the puma master process after "pruning" the master process with `prune_bundler`. This is useful for activating gems that need to be loaded in the master process, such as `puma_worker_killer`. The change to use `spawn` to create new worker processes does _not_ break this: `extra_runtime_dependencies` still works in that the gem will still be activated as expected in the puma master process. However, a side-effect of `extra_runtime_dependencies` was that those dependencies would be loaded in puma worker processes as well (since workers were just forked from the master). The integration test was testing that `extra_runtime_dependencies` were loaded in the worker processes instead of in the master process. This change modifies the test to make assertions on the master processes' gems directly.

It's no longer necessary to support phased restarts.

nateberkopec · 2020-10-07T14:06:06Z

prune_bundler will essentially just be a memory optimization for the puma master process since it'll shed some gems from the puma master.

Is that even worth it? The amount of code/headache from that feature is considerable. We could eject this feature.

nateberkopec · 2020-10-07T14:13:44Z

What effects will this have on how signal handling currently works in Puma? I'm particularly thinking around sending signals to the master process.

cjlarose · 2020-10-07T17:45:10Z

prune_bundler will essentially just be a memory optimization for the puma master process since it'll shed some gems from the puma master.

Is that even worth it? The amount of code/headache from that feature is considerable. We could eject this feature.

Yeah I think the removal of prune_bundler is worth considering. We can do some more testing and measuring separately on whether or not it really is much of a memory optimization, but my guess is that the cost of maintaining it would not longer be worth the benefit it provides.

Importantly, though, I think this PR doesn't introduce any breaking changes as far as semver goes. Removing prune_bundler would, so I think we'd have to wait at least until 6.0 to do it.

cjlarose · 2020-10-07T17:45:44Z

What effects will this have on how signal handling currently works in Puma? I'm particularly thinking around sending signals to the master process.

I'll look more into this.

schneems · 2020-10-07T19:29:29Z

It feels like this is a downstream fix to an upstream problem. Is there not some bundler or Ruby API that we can use to tell our process to do the right thing?

This is, in part, due to a quirk in Bundler where Bundler will raise this error if you're trying to activate a gem whose native extensions are no longer present on disk (even if the gem is already loaded into memory)

Could you reproduce this behavior outside of puma using only bundler? If so then we could report it upstream to bundler, and maybe we can also brainstorm on some workarounds here.

My initial gut reaction is: I want shared memory whenever possible. Processes are expensive (from a memory standpoint) so I want to benefit from CoW whenever possible.

cjlarose · 2020-10-07T22:12:50Z

Could you reproduce this behavior outside of puma using only bundler? If so then we could report it upstream to bundler, and maybe we can also brainstorm on some workarounds here.

Yeah I can create a small reproducible script for this and report it to Bundler.

My initial gut reaction is: I want shared memory whenever possible. Processes are expensive (from a memory standpoint) so I want to benefit from CoW whenever possible.

I think you might be right. This change is essentially a trade-off that'll cost puma more memory, but gives us the following:

The ability for users to upgrade gems in worker processes like nio4r without downtime (using a phased restart on cluster-mode puma)
Related: prevent some kinds of issues related to the application using different versions of gems from the puma master (especially the json gem, see Regression in 4.3.3 with require json #2206)
The potential to remove code related to prune_bundler and relieve ourselves of the maintenance burden of supporting it. This can simplify configuration for users, especially those that rely on extra_runtime_dependencies

Fixing the bug in Bundler (or finding an appropriate workaround in puma) wouldn't give us these other benefits, but it would help make it possible for folks who are stuck on 3.x to upgrade. I'm totally cool with accepting that the proposed change might not be worth the cost. Also, I should mention that it's also totally possible to fix the original problem described in #2018 by just not loading the nio4r gem in the master at all (workers need the gem, but the master never needs it). I proposed this in #2125 and it was initially rejected, but I think it's probably worth reconsidering since it'll help real people migrate their apps to puma 4.x or 5.x basically immediately.

cjlarose · 2020-10-08T03:19:16Z

Opened rubygems/rubygems#4004 for the Bundler bug (bundler shares a repository with rubygems these days, who knew?). We can see what the maintainers say there and decide how to move forward based on that.

nateberkopec · 2020-10-08T13:12:23Z

Just in general, removing fork from Puma's base behavior is going to have to clear a very high bar to make me not nervous. It's up there with "we parse HTTP w/the old mongrel extension".

nateberkopec · 2020-10-13T15:55:16Z

@cjlarose I think you're right about re-opening #2125. What do you think - close this, fix the bundler bug (or see where that goes), and reopen/merge 2125?

cjlarose · 2020-10-13T16:51:13Z

That sounds good. Let's close this and I'll take another stab at #2125 (I'll open a separate PR I think and fix the tests and rebase and everything). If the bundler bug gets fixed, too, that's great, but we don't need to make that our blocker.

S3od22

Sss

rmacklin · 2020-10-29T15:28:38Z

Let's close this and I'll take another stab at #2125 (I'll open a separate PR I think and fix the tests and rebase and everything).

For reference, that newer PR is #2427

cjlarose force-pushed the exec-workers branch from 847c59f to 613abc1 Compare October 2, 2020 23:45

nateberkopec added feature waiting-for-review Waiting on review from anyone labels Oct 3, 2020

cjlarose force-pushed the exec-workers branch from 613abc1 to 23fbe4f Compare October 3, 2020 04:17

cjlarose added 7 commits October 4, 2020 19:39

Add Cluster#supports_phased_restart?

fa7035e

Update History.md

a5e9f04

Remove recommendation to use prune_bundler in deployment docs

c687f41

It's no longer necessary to support phased restarts.

cjlarose force-pushed the exec-workers branch from 23fbe4f to c687f41 Compare October 5, 2020 02:40

Update prune_bundler information in dsl.rb

4a16b8c

cjlarose mentioned this pull request Oct 8, 2020

Bundler.setup raises Bundler::GemNotFound for gems whose native extensions have been removed, even if the gem is already activated rubygems/rubygems#4004

Open

cjlarose closed this Oct 13, 2020

cjlarose mentioned this pull request Oct 14, 2020

phased-restart: Could not find nio4r-2.5.1 in any of the sources (Bundler::GemNotFound) #2018

Closed

S3od22 reviewed Oct 25, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new worker processes in separate memory spaces from master process #2407

Create new worker processes in separate memory spaces from master process #2407

cjlarose commented Oct 2, 2020

MSP-Greg commented Oct 3, 2020

cjlarose commented Oct 3, 2020

nateberkopec commented Oct 7, 2020

nateberkopec commented Oct 7, 2020

cjlarose commented Oct 7, 2020

cjlarose commented Oct 7, 2020

schneems commented Oct 7, 2020 •

edited

cjlarose commented Oct 7, 2020 •

edited

cjlarose commented Oct 8, 2020 •

edited

nateberkopec commented Oct 8, 2020

nateberkopec commented Oct 13, 2020

cjlarose commented Oct 13, 2020

S3od22 left a comment

rmacklin commented Oct 29, 2020

Create new worker processes in separate memory spaces from master process #2407

Create new worker processes in separate memory spaces from master process #2407

Conversation

cjlarose commented Oct 2, 2020

Description

Your checklist for this pull request

MSP-Greg commented Oct 3, 2020

cjlarose commented Oct 3, 2020

nateberkopec commented Oct 7, 2020

nateberkopec commented Oct 7, 2020

cjlarose commented Oct 7, 2020

cjlarose commented Oct 7, 2020

schneems commented Oct 7, 2020 • edited

cjlarose commented Oct 7, 2020 • edited

cjlarose commented Oct 8, 2020 • edited

nateberkopec commented Oct 8, 2020

nateberkopec commented Oct 13, 2020

cjlarose commented Oct 13, 2020

S3od22 left a comment

Choose a reason for hiding this comment

rmacklin commented Oct 29, 2020

schneems commented Oct 7, 2020 •

edited

cjlarose commented Oct 7, 2020 •

edited

cjlarose commented Oct 8, 2020 •

edited