Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1842 #1945

Closed
wants to merge 6 commits into from
Closed

Fix #1842 #1945

wants to merge 6 commits into from

Conversation

nateberkopec
Copy link
Member

@nateberkopec nateberkopec commented Sep 2, 2019

@nateberkopec
Copy link
Member Author

This is almost there for me locally. It's now only failing on the phased restart test, because something appears to be closing the listener unnecessarily (no idea what that is yet but shouldn't be too hard to figure out.)

@@ -193,12 +193,12 @@ def test_phased_restart_via_pumactl
done = true
end
end
# Stop

assert File.exist? @bind_path
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fudoshiki Why did you originally insert this assertion after the stop command is issued? Shouldn't stopping the server remove the binder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this assert now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we? The original issue was that this file was removed after a phased restart was begun.

Copy link
Contributor

@Fudoshiki Fudoshiki Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My wrong, we should keep this

  • If we call phased-restart - the file must exist, connections from old workers should close
  • If we stopping the server, the file must be deleted

Maybe we need add same assert in until block above too
And refute File.exist? @bind_path after

ccli = Puma::ControlCLI.new ["-S", @state_path, "stop"], sout

@nateberkopec
Copy link
Member Author

Possible clue in the output

[94299] - TERM sent to 94301...
[94299] - TERM sent to 94301...
[94299] - KILL sent to 94301...

This is intentional, as the test is trying to trigger a long response (sleep 40 seconds) and then have the phased restart mechanism timeout the worker trying to restart it, but it's possible that this triggers binding.close somewhere.

@nateberkopec nateberkopec added this to the 4.1.1 milestone Sep 3, 2019
@nateberkopec nateberkopec mentioned this pull request Sep 3, 2019
@nateberkopec
Copy link
Member Author

Okay, I think I've got it?

Basically we're calling binder.close all over the place in server.rb, always have been. However, when phased restarting, each worker will eventually call binder.close. That's fine for the first worker but the second will fail.

Previously the @own_binder logic in #1685 prevented phased restarting worker processes from calling binding.close, because it wasn't their "own" binder. Makes sense.

So, we just have to figure out how to add that back in, in a way that also closes the socket when we are shutting down in a non-phased-restart scenario.

@nateberkopec
Copy link
Member Author

Things I know:

  • The tests are correct now and seem to be describing the correct behavior.
  • Do not accept new requests on shutdown #1685 made a big mistakey by removing the own_binder logic.
  • Just randomly removing code and hoping the tests pass is a Bad Idea without a complete understanding of the situation with Puma.
  • Binder closing is surprisingly complicated, as it needs to work in tons of different scenarios - phased restart, regular restart, shutdown in a cluster, shutdown in single mode, etc.

What I'm not sure of:

  • The code path for a phased restart/shutdown is different when using signals than when using control-cli over a socket, is this bug only present on one codepath or the other?
  • own_binder logic isnt really working because now no-one is closing the socket correctly, as every cluster runner inherits its listener from the launcher.

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 3, 2019

Similar to what you're thinking, in the past I've wondered why Binder has @ios and @unix_paths, with @listeners containing very similar information.

Kind of goes hand in hand with the similarity between Binder#close and Launcher#close_binder_listeners...

@nateberkopec nateberkopec removed this from the 4.1.1 milestone Sep 5, 2019
@nateberkopec nateberkopec deleted the socket-removed-after branch March 14, 2020 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants