Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

wjordan · 2020-01-04T17:54:36Z

This is still rather experimental, but I thought I'd share this work in progress to share preliminary results and spark discussion on whether this optimization seems at all interesting to anyone else to pursue further.

Description

This PR adds a config option (fork_worker) to spawn new worker processes by forking an existing worker process instead of the master process. It also adds a refork command (currently implemented via a URG signal) which re-forks all processes in the cluster from this original worker.

Because the existing worker has been serving live requests, forking new workers from this fully-loaded process can potentially reduce memory usage by further optimizing copy-on-write performance across processes, as a step beyond what the preload_app! directive accomplishes at initialization.

Demonstration

The ideal use-case for this technique involves a Rack application that loads a lot of persistent (non-garbage-collected) content into memory at runtime, instead of at initialization-time where preload_app! is able to load it before the initial fork. (Think of dynamically generated data or code that loads only after particular user inputs, or an in-memory object cache, etc..)

Here's a minimal Rack application which demonstrates the optimization:

# config.ru
run lambda { |_env|
  $big_object ||= Array.new(10_000_000) {Object.new}
  [200, {}, ["Hello world"]].tap{3.times{GC.start}}
}

# config/puma.rb
workers 8
state_path 'puma.state'
activate_control_app
daemonize
fork_worker

To test and measure 'real' memory usage, compare the total pss across all workers with a good amount of request traffic, before and after re-forking cluster workers using the added refork command.

# Total pss of all workers
function pss() {
  bundle exec pumactl stats | ruby -rjson -e '
    puts JSON.parse(ARGF.readlines.last)["worker_status"].map {|w| w["pid"]}.
      sum{|p| File.read("/proc/#{p}/smaps_rollup").match(/Pss:\s+(\d+)/)[1].to_i}
  '
}

$ bundle exec puma
$ wrk http://localhost:9292 -d 20 -H "Connection: close"
$ pss
> 4933788
$ bundle exec pumactl refork
$ wrk http://localhost:9292 -d 20 -H "Connection: close"
$ pss
> 1053686

Memory usage in this (obviously very contrived) example is reduced from ~5GB to ~1GB, or an ~80% reduction in total memory usage across 8 workers.

Obviously most real-world use-cases won't see such a dramatic difference, but this still might possibly be a useful optimization in some situations. I'll look forward to testing on some large Rack applications to see how useful this might be in practice.

Production demo

I ran a rough comparison test on a production Rails app running 40 workers on a single instance, measuring memory usage after serving traffic for 15 minutes after a restart (for 'Baseline'), then restarting workers and serving traffic for another 15 minutes (for 'Optimized'):

Baseline

Per-worker Usage (MB): 2170 Rss, 1530 Pss, 655 Shared

Optimized

Per-worker Usage (MB): 2168 Rss, 1177 Pss (-23%), 1041 Shared (+59%)

This demo shows a ~20% improvement in memory usage on this particular application. (Note this app is running Ruby 2.5, the compacting GC in Ruby 2.7 may enable even greater CoW efficiency.)

nateberkopec · 2020-02-07T16:21:14Z

lib/puma/cluster.rb

@@ -337,8 +351,7 @@ def stop
    def stop_blocked
      @status = :stop if @status == :run
      wakeup!
-      @control.stop(true) if @control
-      Process.waitall
+      sleep 0.2 until @status == :halt


Process.waitall no longer completes here, because during a re-fork one of the parent's children is the promoted process which doesn't exit but instead becomes the new parent.

Instead, the halt status is set after the promotion process is complete and all other worker-children have terminated.

nateberkopec · 2020-02-07T16:25:22Z

Why did I never know about SIGURG? :headdesk:

I'm a little nervous about the way this restart is implemented, there are some random changes in here that I think will mess up restart state. That can be fixed with some test coverage ofc.

This introduces a full hard-restart, though, right? Who is going to be willing to trade memory savings for 60-120 seconds of hard downtime?

nateberkopec · 2020-02-07T16:23:31Z

lib/puma/runner.rb

@@ -64,30 +63,19 @@ def start_control
      control.min_threads = 0
      control.max_threads = 1

-      case uri.scheme


Wait this is possible? Can we do this is a separate PR?

Extracted this change to #2111.

nateberkopec · 2020-02-07T16:23:46Z

lib/puma/launcher.rb

@@ -184,7 +184,6 @@ def run
      when :exit
        # nothing
      end
-      @binder.close_unix_paths


This reflects a bit of refactoring- this line caused the old parent to delete the Unix socket when shutting down during a promote operation, which in this PR needs to be skipped. The call here is made unnecessary by this change to Cluster#run, which includes a call to close_binder_listeners, which also cleans up any Unix sockets - but is skipped during promote.

Some of this refactoring could be extracted to a separate PR without the promote logic.

See #2112 for the extracted listener-closing refactoring part of this PR, which can be considered separately from the rest of the promote logic.

wjordan · 2020-02-07T16:38:41Z

This introduces a full hard-restart, though, right? Who is going to be willing to trade memory savings for 60-120 seconds of hard downtime?

No, it just forks a new set of worker processes from one of the already-running workers, which is nearly instant, the app doesn't need to be reloaded from scratch at all.

wjordan · 2020-02-11T23:52:35Z

I'm a little nervous about the way this restart is implemented, there are some random changes in here that I think will mess up restart state. That can be fixed with some test coverage ofc.

I completed a pass through to minimize the changes. It's now based off two smaller PRs (#2111, #2112) that have some related fixes extracted out, leaving the promote-specific logic in a single commit (8872e88) that's more concise. I've also added a simple integration-test ensuring the promote control command works properly on a cluster.

evanphx

I think the biggest problem with the approach is how someone would actually use it. I can't figure out how a production system would know when to issue the promote call. If we were going to do this, it would need to be automatic instead.

evanphx · 2020-02-12T00:05:00Z

lib/puma/cluster.rb

+
+      def promote
+        Process.kill "URG", @pid
+        Process.detach @pid


Process.detach should never be used. It's implementation is extremely bad as it spawns a Thread just to perform a wait in a loop.

nateberkopec · 2020-02-12T12:49:14Z

I was thinking about testing this on CodeTriage with promotion after a few thousand requests

wjordan · 2020-02-13T23:37:46Z

I did another pass on the implementation to simply further and make the diff smaller. ~~Instead of a new 'promote' signal~~ In addition to a new refork command now there's a new configuration option fork_worker that when enabled, causes new workers to be forked from worker-0 instead of from the master that join the existing cluster like normal.

This approach leverages the existing phased-restart logic to cycle workers one at a time for copy-on-write improvements with minimal impact to a live site.

nateberkopec · 2020-02-25T13:15:16Z

@wjordan Are those 4 new commits you added to this branch required to make the original promote feature pass the tests?

If not, could you break all 4 of those into separate PRs?

nateberkopec · 2020-02-25T13:22:42Z

So, we're also going to need a new "worker recycle" option to go with this feature. Essentially, after serving X requests, restart/refork all workers gradually.

This option should only be possible with reforking, and it should only happen once. This is so that people don't use it to create their own puma_worker_killer.

e.g. refork_after_request_count = 2000. I'll come up with a sensible default with some production testing on CodeTriage.

wjordan · 2020-02-25T19:27:56Z

@wjordan Are those 4 new commits you added to this branch required to make the original promote feature pass the tests?
If not, could you break all 4 of those into separate PRs?

Those extra commits were from #2123 which I rebased this PR on, I thought some of that work may have fixed some of the test failures in this PR. I've rebased this work off master again and it looks like the tests are passing, so it might have been unrelated.

nateberkopec · 2020-03-20T13:44:26Z

Deploying to CodeTriage production to give it a shot.

As for automatic, I think what we should provide is probably an option to do this after a certain amount of requests have been processed (I'm not sure what # is required, but we'll see after testing in prod).

What I'm curious about is if production traffic might expose that these gains are temporary. CoW in Ruby is not great right now. I'm wondering if benchmarking against the same route over and over exaggerates the possible gain because the amount of object allocation is predictable and probably very low. Prod loads will let us know the answer.

nateberkopec · 2020-03-20T15:44:33Z

So I tried it and worker 0 reforked? I think that's the opposite of what's supposed to happen?

nateberkopec · 2020-03-20T15:45:10Z

Codetriage is also maybe not a great app to test this on because it has such low memory usage to begin with.

nateberkopec · 2020-03-20T15:51:12Z

I think this is interesting enough that we can include it still as an "experiment" in 5.0 with the changes required as I described.

nateberkopec · 2020-03-20T15:51:45Z

lib/puma/dsl.rb

+    # `preload_app` is not required when this option is enabled.
+    #
+    # @note Cluster mode only.
+    def fork_worker(answer=true)


Should be false by default

nateberkopec · 2020-03-21T12:26:59Z

Probably should also provide some guidance in docs about forking from a dirty state - connection pools, state, etc.

wjordan · 2020-04-02T18:32:53Z

Thanks @nateberkopec for spending time testing out this branch! I'll try and find time soon to do another pass on this PR based on the feedback.

So I tried it and worker 0 reforked? I think that's the opposite of what's supposed to happen?

Yeah, worker 0 shouldn't reload itself when sent the refork command (URG signal), only when sent a normal phased_restart command (USR1 signal).

Here's the output I get when starting a test app and running pumactl refork in another terminal, this is what I'd expect to happen, a 'phased restart' reloading all workers except worker 0:

Puma log

[10074] Puma starting in cluster mode...
[10074] * Version 4.3.1 (ruby 2.6.0-p0), codename: Mysterious Traveller
[10074] * Min threads: 0, max threads: 16
[10074] * Environment: development
[10074] * Process workers: 8
[10074] * Phased restart available
[10074] * Listening on tcp://0.0.0.0:9292
[10074] Use Ctrl-C to stop
[10074] * Starting control server on unix:///tmp/puma-status-1585851969025-10074
[10074] - Worker 0 (pid: 10123) booted, phase: 0
[10074] - Worker 1 (pid: 10133) booted, phase: 0
[10074] - Worker 2 (pid: 10138) booted, phase: 0
[10074] - Worker 3 (pid: 10143) booted, phase: 0
[10074] - Worker 4 (pid: 10149) booted, phase: 0
[10074] - Worker 5 (pid: 10155) booted, phase: 0
[10074] - Worker 6 (pid: 10159) booted, phase: 0
[10074] - Worker 7 (pid: 10168) booted, phase: 0
[10074] - Starting phased worker restart, phase: 1
[10074] + Changing to [DIR]
[10074] - Stopping 10133 for phased upgrade...
[10074] - TERM sent to 10133...
[10074] - Worker 1 (pid: 10588) booted, phase: 1
[10074] - Stopping 10138 for phased upgrade...
[10074] - TERM sent to 10138...
[10074] - Worker 2 (pid: 10607) booted, phase: 1
[10074] - Stopping 10143 for phased upgrade...
[10074] - TERM sent to 10143...
[10074] - Worker 3 (pid: 10627) booted, phase: 1
[10074] - Stopping 10149 for phased upgrade...
[10074] - TERM sent to 10149...
[10074] - Worker 4 (pid: 10645) booted, phase: 1
[10074] - Stopping 10155 for phased upgrade...
[10074] - TERM sent to 10155...
[10074] - Worker 5 (pid: 10690) booted, phase: 1
[10074] - Stopping 10159 for phased upgrade...
[10074] - TERM sent to 10159...
[10074] - Worker 6 (pid: 10708) booted, phase: 1
[10074] - Stopping 10168 for phased upgrade...
[10074] - TERM sent to 10168...
[10074] - Worker 7 (pid: 10724) booted, phase: 1

What is the unexpected output you saw?

nateberkopec · 2020-04-15T00:02:03Z

NVM my comment - I got the output you got.

wjordan · 2020-05-02T00:06:49Z

Finished another pass on this feature. Changes:

Added some documentation for the new experimental feature
Added an argument to fork_workers to auto-refork once after worker-0 serves a specified number of requests (default 1000 for now).
Attempt to address issues with forking worker-0 from a dirty state. Instead of just forking as-is (which could leak open file-descriptors and undefined behavior if user application-code is running mid-fork), worker-0 will first shutdown its running Server object before forking, and start it up again only after the full (re)start is completed.
- This has the slight (but acceptable for now?) drawback of n-1 worker availability during re-fork, and n-2 during a normal phased-restart (where it's normally n-1). I added a paragraph in the documentation covering this limitation.

Trigger refork for improved copy-on-write performance.

@notify

Eliminate `instance variable @notify not initialized` warnings

nateberkopec · 2020-05-11T01:32:21Z

Thanks for your hard work here @wjordan - I'm really excited to test out this feature in the 5.0 beta.

lib/puma/cluster.rb

nateberkopec added the perf label Feb 7, 2020

nateberkopec reviewed Feb 7, 2020

View reviewed changes

wjordan mentioned this pull request Feb 8, 2020

Refactor listener closing code #2112

Merged

8 tasks

wjordan force-pushed the promote branch 2 times, most recently from 81624bd to 8872e88 Compare February 11, 2020 22:48

evanphx requested changes Feb 12, 2020

View reviewed changes

wjordan force-pushed the promote branch 2 times, most recently from 52f40e9 to 7c978e8 Compare February 13, 2020 23:00

wjordan changed the title ~~Experiment: Promote worker process to optimize copy-on-write performance~~ Experiment: Fork from existing worker process to optimize copy-on-write performance Feb 13, 2020

wjordan force-pushed the promote branch from 7c978e8 to 873d089 Compare February 21, 2020 07:44

nateberkopec added the waiting-for-review Waiting on review from anyone label Feb 21, 2020

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Feb 25, 2020

wjordan force-pushed the promote branch from 873d089 to 53839f9 Compare February 25, 2020 19:08

nateberkopec added this to the 5.0.0 milestone Feb 27, 2020

nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels Mar 20, 2020

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Mar 20, 2020

nateberkopec reviewed Mar 20, 2020

View reviewed changes

wjordan mentioned this pull request Apr 2, 2020

"Preloader" process for preload + rolling restart #1875

Closed

wjordan mentioned this pull request Apr 29, 2020

Optionally, remove master process or allow it to take requests #2215

Open

wjordan force-pushed the promote branch 2 times, most recently from e10d26b to e5fe1dc Compare May 1, 2020 23:47

Add fork_worker option and refork command

613ac8e

Trigger refork for improved copy-on-write performance.

wjordan force-pushed the promote branch from e5fe1dc to 613ac8e Compare May 2, 2020 00:21

wjordan and others added 2 commits May 1, 2020 20:05

Define @notify in Server#initialize

0c25b37

Eliminate `instance variable @notify not initialized` warnings

Add on_refork hook configuration and integration test

0e2fec5

wjordan force-pushed the promote branch from d819a6b to 0e2fec5 Compare May 4, 2020 21:54

nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels May 7, 2020

nateberkopec merged commit 0e7d6fa into puma:master May 11, 2020

nateberkopec mentioned this pull request May 12, 2020

Puma 5 Experimental Feature Reports #2258

Closed

wjordan commented May 15, 2020

View reviewed changes

lib/puma/cluster.rb Show resolved Hide resolved

wjordan mentioned this pull request May 15, 2020

Fix for spawning subprocesses with fork_worker option #2267

Merged

8 tasks

h0jeZvgoxFepBQ2C mentioned this pull request Oct 6, 2020

Improve documentation what hot restart means #2366

Closed

cjlarose mentioned this pull request Oct 28, 2020

cluster.rb - rearrange SIGURG & fork_worker code, small test fix [changelog skip] #2449

Merged

8 tasks

casperisfine mentioned this pull request Jul 11, 2022

Allow to use preload_app! with fork_worker #2907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

wjordan commented Jan 4, 2020 •

edited

nateberkopec Feb 7, 2020

wjordan Feb 7, 2020

nateberkopec commented Feb 7, 2020

nateberkopec Feb 7, 2020

wjordan Feb 7, 2020

nateberkopec Feb 7, 2020

wjordan Feb 7, 2020 •

edited

wjordan Feb 8, 2020

wjordan commented Feb 7, 2020 •

edited

wjordan commented Feb 11, 2020

evanphx left a comment

evanphx Feb 12, 2020

nateberkopec commented Feb 12, 2020

wjordan commented Feb 13, 2020 •

edited

nateberkopec commented Feb 25, 2020

nateberkopec commented Feb 25, 2020

wjordan commented Feb 25, 2020

nateberkopec commented Mar 20, 2020

nateberkopec commented Mar 20, 2020 •

edited

nateberkopec commented Mar 20, 2020

nateberkopec commented Mar 20, 2020

nateberkopec Mar 20, 2020

nateberkopec commented Mar 21, 2020

wjordan commented Apr 2, 2020

nateberkopec commented Apr 15, 2020

wjordan commented May 2, 2020 •

edited

nateberkopec commented May 11, 2020

Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

Conversation

wjordan commented Jan 4, 2020 • edited

Description

Demonstration

Production demo

Baseline

Optimized

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateberkopec commented Feb 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjordan Feb 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjordan commented Feb 7, 2020 • edited

wjordan commented Feb 11, 2020

evanphx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateberkopec commented Feb 12, 2020

wjordan commented Feb 13, 2020 • edited

nateberkopec commented Feb 25, 2020

nateberkopec commented Feb 25, 2020

wjordan commented Feb 25, 2020

nateberkopec commented Mar 20, 2020

nateberkopec commented Mar 20, 2020 • edited

nateberkopec commented Mar 20, 2020

nateberkopec commented Mar 20, 2020

Choose a reason for hiding this comment

nateberkopec commented Mar 21, 2020

wjordan commented Apr 2, 2020

nateberkopec commented Apr 15, 2020

wjordan commented May 2, 2020 • edited

nateberkopec commented May 11, 2020

wjordan commented Jan 4, 2020 •

edited

wjordan Feb 7, 2020 •

edited

wjordan commented Feb 7, 2020 •

edited

wjordan commented Feb 13, 2020 •

edited

nateberkopec commented Mar 20, 2020 •

edited

wjordan commented May 2, 2020 •

edited