Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify client close #310

Merged
merged 1 commit into from
Oct 31, 2020
Merged

Conversation

viktordick
Copy link

Implementation of the idea discussed in #308. The tests don't pass yet and I should also implement additional tests for this, but hopefully this helps with the discussion.

@viktordick
Copy link
Author

@bertjwregeer Is there a chance that this PR will be accepted if I fix the tests and add new ones for the new feature? Or do you think our request is too exotic and the added complexity should be avoided?

Copy link
Member

@digitalresistor digitalresistor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to adding the feature, but not in this current form.

  1. Adding an extra lock for something most applications won't use adds extra overhead that is not necessary
  2. I'd want this to be a flag that is set in the configuration, so this needs an adjustment flag, so that app owners need to opt-in to this behaviour, I do not like the idea of this being a default
  3. I would be okay with a new key being added to the environment that is a callable that can check if a channel is now closed because of client disconnection. This is the suggestion that @mmerickel made.

The task already has access to the channel, and since WSGITask is creating the environ, adding a function that is:

def check_alive(self):
    return self.channel.client_disconnected

would be simple.

And in handle_read() we set client_disconnected if the client has disconnected (i.e. the read returns 0). Then in Task we add a check in finish() to see if self.channel.client_disconnected is true, to raise ClientDisconnected so that the logic in HTTPChannel.service() does the right thing to tear everything down.

Regarding number 2, I don't mind the extra knobs being available in the WSGI environment by default, although adjustments is available inside of Task as well, so it could be added to the WSGI environment conditionally.

src/waitress/channel.py Outdated Show resolved Hide resolved
@viktordick
Copy link
Author

@bertjwregeer Thanks for the review.

Regarding 3:
I guess there are three possibilities. A) the application inserts a callable into the environment that waitress then calls (in the main thread, not in the worker thread!). B) Waitress inserts a flag into the environment once the connection is closed and the application checks for it. C) Waitress inserts a callable into the environment that the application can use to check for client disconnection.

I was thinking that A) would be easiest to deal with on the application side, especially if we are in a situation where we spawned another process or are currently waiting for an SQL database query to be performed. When triggered, we could then terminate the spawned process or the database worker process.

However, this has the mentioned disadvantage that the handler interrupts the main thread of waitress. And maybe it also adds unnecessary complexity (maybe the additional lock was necessary due to this approach and is not required with C, but I am not sure yet). I will try to think about this a bit more.

Regarding 1:
I think the additional lock is necessary because handle_read is running in the main thread while service is running in the task thread and before this change, they never were executed simultaneously for the same channel (while there was a request being processed, readable always returned False). I will try to think a little bit more about this - maybe by switching the strategy to C it is not necessary. But I do not think it is related to the strategy, it is simply because we have one thread that might create new requests (the main thread) and one that is processing requests, and they now may be running simultaneously which they did not do before.

Regarding 2:
I was not sure how to implement a flag that may be set using PasteDeploy before, so in this draft I used the global variable max_requests_per_channel, which should restore the default functionality if set to 0 (I guess the name should be changed to something like max_lookahead_requests ). If I understand correctly, I can set this as one of the parameters to waitress.serve. By giving it a default value of 0 there, this should restore the default functionality. I will try it.

@viktordick
Copy link
Author

Addendum: Regarding the difference between strategies B and C, I guess with C we do not need to make sure to catch the environment of the currently running task as well as any possibly queued tasks. However, I guess we anyhow do not want to start any more tasks if the client disconnected, do we? In any case, C seems to be superior to B. (C was also the strategy that @mmerickel suggested).

@viktordick
Copy link
Author

@bertjwregeer I tried to implement the requested changes. There is still a lot to be done regarding tests and documentation, but is the general idea now something that might be accepted?

Now that I looked over it again, the additional lock I had in the original attempt was due to race conditions with self.current_task, which was needed in order to implement strategy A as mentioned above. It is no longer needed with C (the remaining race condition is between received() adding requests to be processed and service() removing them, but self.requests.extend(requests) is already protected by the GIL).

@viktordick
Copy link
Author

@bertjwregeer I changed the implementation so service() only processes one request and re-inserts itself using add_task if there are more requests. This requires a lock to cleanly communicate if the main thread should call add_task or if the task thread should do it and prevent race conditions. However, I am pretty sure that I could re-use the task_lock for this. It is no longer necessary to guard the complete service method with this lock since by only re-inserting the next task at the very end of the function, it is ensured that no two requests of the same channel are processed in the wrong order or in parallel.

However, this makes the difference to master rather large since most of the content in service() changed the indentation level.

One change I am not sure about is the handling of the output buffers. Due to the change, they are flushed after each request separately. Before, if the client sent multiple short requests in a pipeline such that they would be picked up with a single read(), the output buffers might only be flushed once all of them were processed (I think).

I did some obvious adjustments to the tests. I also dropped test_service_no_requests since it does not work anymore (service() assumes that there is a request to be processed), but it anyhow does not handle any real life situation, add_task is only called if there is a request to be processed.

There are two more failing tests that I will check next. And, of course, I need to add tests for the new functionality.

src/waitress/channel.py Outdated Show resolved Hide resolved
src/waitress/channel.py Outdated Show resolved Hide resolved
src/waitress/task.py Outdated Show resolved Hide resolved
src/waitress/channel.py Outdated Show resolved Hide resolved
@digitalresistor
Copy link
Member

@viktordick thanks for updating it, I haven't taken a look yet but I saw that @mmerickel provided some feedback. I've been busy moving, hopefully once things things calm down a little bit I'll get back to this and help figure out a best path forward.

@viktordick viktordick marked this pull request as ready for review September 13, 2020 07:38
@viktordick
Copy link
Author

@bertjwregeer, @mmerickel So I tried to read through the code in order to get a feeling of when each variable is used and came up with the following:

task.close_on_finish

  • set by ClientDisconnected during trying to write part of the response if the socket is closed or if the task fails to write a header.
  • Sets close_when_flushed and closes all pending requests after processing the current task

close_when_flushed

  • makes the channel writable and not readable
  • in handle_write, if there are no more buffers to be flushed or if there is an error, it sets will_close

will_close

  • makes the channel writable and not readable
  • makes the final step in handle_write a call to handle_close, which cleans up buffers (these might even be on disk)
  • handle_close sets connected = False after everything is cleaned up

connected

  • Skips the execution of handle_write if False
  • If encountered to be False inside functions that might be called from within the task thread, it raises ClientDisconnected
  • Is never set to True inside channel, only in wasyncore.dispatcher when the channel is recycled for a new connection

In summary: connected is the signal that everything is closed and finished and nothing should be done on the channel until the dispatcher reuses it for a new connection. Using connected directly from the main thread on client disconnection might fail to clean up output buffers and depend on the garbage collector to do this, which might miss buffers that have overflown into files. Worse, it might contaminate the next connection that reuses the same Channel instance, although I have not dug this deep into the code to verify this.

Even though the tests were successful once I merged client_disconnected and not connected, I guess it is better to restore the separation? Maybe under another name and with additional comments explaining the difference.

Curiously, the tests were successful after I pushed 41155fc, but after I changed one comment that was not yet up-to-date, a test failed. There seems to be a race condition that I have been unable to reproduce until now. Not sure if it was already there before or if the problem is related to the handling of connected.

Could any of you please confirm my conclusion or show me where I am mistaken?

@mmerickel
Copy link
Member

Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or connected getting reset to True.

From what I'm seeing, connected is the right hook to use here to indicate that the connection is dead and the client is gone. This is what your current code is using. It sounds like this is a concern for you but I'm not clear why or what you think another variable will help capture better? Is there a lifecycle that you want to try to catch before it is closed?

I noticed that there's a potential race condition in channel.received where we are using _flush_some to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption.

@viktordick
Copy link
Author

Thanks @viktordick for going through this, I think I agree with most of your analysis. A channel is never reused so there is no concern about the buffer cleanup or connected getting reset to True.
But why do we then need close_when_flushed and will_close? ClientDisconnected is raised if the task tries to write part of the response into the output buffers and the client has disconnected. But why is service then going through all the steps instead of simply setting connected to False, emptying self.requests and returning? Even if handle_close should be called from the main thread instead of the task thread (which is the case I guess), this only explains the existence of will_close, not of close_when_flushed - I think.

I could imagine some sort of graceful shutdown which might use close_when_flushed to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.

I noticed that there's a potential race condition in channel.received where we are using _flush_some to write out a response. The underlying assumption in that block of code is that if we're reading, then no requests are active, but of course this PR breaks that assumption.
Thanks, I will check this out

@mmerickel
Copy link
Member

mmerickel commented Sep 25, 2020

But why do we then need close_when_flushed and will_close

So there's a bunch of reasons to close that are triggered via client, wsgi app, or process:

  • client not using keep-alive (close_when_flushed = True)
  • unhandled wsgi app exception (close_when_flushed = True)
  • client unexpected hangup (connected = False)
  • graceful server-triggered shutdown (not well supported but triggered via Channel.cancel / will_close = true)

I could imagine some sort of graceful shutdown which might use close_when_flushed to process all currently running tasks, but not accept any more. However, I do not see any code that might implement something like this.

This is sort of supported right now via the Channel.cancel method which is invoked from the threadpool (TaskDispatcher). However it might not be as graceful as you'd expect. Right now it stops ASAP versus after draining the pending requests. I think as we get further into supporting a real graceful shutdown that behavior would change.

@mmerickel
Copy link
Member

Also yes, will_close is being used to ensure we are invoking handle_close from within the event loop and not from the channel's thread. Like I said above I think that its meaning would change if we added a better graceful shutdown. Likely to something closer to "close after flushing the current list of requests plus one we might be reading right now, and do not start reading any new ones".

Copy link
Member

@digitalresistor digitalresistor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going. More graceful shutdown would be fantastic, and its something that I would like to add, so keeping that in mind while continuing down this path would be great.

src/waitress/channel.py Outdated Show resolved Hide resolved
tests/test_task.py Outdated Show resolved Hide resolved
@viktordick viktordick force-pushed the notify-client-close branch 2 times, most recently from 3859888 to f3176a8 Compare October 2, 2020 18:07
@viktordick
Copy link
Author

I secured the writing into the outbufs in reeceived with the outbuf_lock, which seems to work well.

I'd say this PR is ready for a re-review. One check seems to be stuck - I had this problem with some other PR on github once and it seemed to be necessary to close and re-open the PR to get the tests to run again. Should I do this here also?

@digitalresistor
Copy link
Member

Maintainers have the ability to hit the re-run all jobs button. So no need to close/open this PR.

@viktordick
Copy link
Author

So this may work for the second or third request in a pipelined setup, but we should endeavour to send a reply back as soon as possible.

Yes, that is what the code does now. If there are no other request currently being executed or queued to be executed, the 100-continue is sent as soon as the request that expects it finished submitting the header (as before). Only if it is a request that is pipelined after other requests which are still being executed, the 100-continue is delayed until all tasks before it have finished, which is necessary to keep the order of responses correct.

So the 100-continue is always written as soon as it is possible to write it without messing up the order of the responses.

src/waitress/channel.py Outdated Show resolved Hide resolved
src/waitress/channel.py Outdated Show resolved Hide resolved
@mmerickel
Copy link
Member

@viktordick this is awesome, in my mind the only thing missing is a changelog entry!

This inserts a callable `waitress.client_disconnected` into the
environment that allows the task to check if the client disconnected
while waiting for the response at strategic points in the execution,
allowing to cancel the operation.

It requires setting the new adjustment `channel_request_lookahead` to a
value larger than 0, which continues to read requests from a channel
even if a request is already being processed on that channel, up to the
given count, since a client disconnect is detected by reading from a
readable socket and receiving an empty result.
@viktordick
Copy link
Author

@mmerickel Thanks! I wrote a changelog entry and squashed the commits.

@bertjwregeer Are you also happy with the result? There is still a change request from you recorded on the PR, but I think everything mentioned there has been addressed in the mean time.

@digitalresistor
Copy link
Member

Thanks for being patient and working through this. I want to read this critically one last time and make sure that I've convinced myself that there are no locking issues/race conditions.

Thanks for all your hard work!

@digitalresistor
Copy link
Member

Thanks so much @viktordick for your patience and working through this. This looks great, and I have adequately convinced myself that this won't break!

@digitalresistor digitalresistor merged commit 31d7498 into Pylons:master Oct 31, 2020
@viktordick
Copy link
Author

Thanks a lot for the help and patience!

Is a release already planned sometime soon?

@digitalresistor
Copy link
Member

@viktordick https://pypi.org/project/waitress/2.0.0b0/

I need testers to validate this doesn't break anything, but there is now a released version of waitress you can try in your environment.

@viktordick
Copy link
Author

Thanks.
We are actually already using it - we simply use

git+https://github.com/perfact/waitress@2.0.0dev0.perfact.2

In our requirements.txt, which is essentially the current state of upstream master plus a commit that gives a custom tag. Although the number of systems that actually already use it is still small (I think no production system yet).

@wondie
Copy link

wondie commented Oct 27, 2021

I couldn't use it in waitress serve in version 2.0. I get a stack trace ValueError: Unknown adjustment 'client_disconnected' when using it as serve(MyFlask.app, host="111.111.111", port=os.environ["PORT"], channel_request_lookahead=5, client_disconnected=run_on_exit). Is it yet implimented?

@mmerickel
Copy link
Member

That's not how this feature works. You can't inject your own callable into the environ.

@wondie
Copy link

wondie commented Oct 27, 2021

Then how is it used? I couldn't find any example on it.

@mmerickel
Copy link
Member

When using waitress the wsgi environ is providing an extra key with a callable that can be invoked to see if the client is still connected. It is not a callback. You have to poll it in your request handling code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants