Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check StreamingCallSink is alive at close #211

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

illicitonion
Copy link
Contributor

There is nothing in place which guarantees that sending the last
message on the Sink, and closing the Sink, happen atomically.

There is currently a race condition where if the server responds
incredibly quickly, its response may be received between the last
streaming message being sent, and close being called.

It isn't obvious to me what value this check_alive call is bringing; if
there is one, we should try to guarantee its success by either:

  1. Implementing a custom send_all implementation such that the last
    send and close are called atomically, getting rid of the window
    where the race can be lost.
  2. Establishing a happens-before relationship between
    StreamingCallSink.close and whatever code receives the server
    response (I think Call.start_recv_message?). I think this would
    either need a pretty coarse-grained lock, or a pretty complex
    restructuring of code.

But as the check doesn't appear to provide any useful guarantees (and
indeed mostly appears to make losing this race condition problematic),
I figured I'd try just removing it...

There is nothing in place which guarantees that sending the last
message on the Sink, and closing the Sink, happen atomically.

There is currently a race condition where if the server responds
incredibly quickly, its response may be received between the last
streaming message being sent, and close being called.

It isn't obvious to me what value this check_alive call is bringing; if
there is one, we should try to guarantee its success by either:
 1. Implementing a custom send_all implementation such that the last
    send and close are called atomically, getting rid of the window
    where the race can be lost.
 2. Establishing a happens-before relationship between
    StreamingCallSink.close and whatever code receives the server
    response (I think Call.start_recv_message?). I think this would
    either need a pretty coarse-grained lock, or a pretty complex
    restructuring of code.

But as the check doesn't appear to provide any useful guarantees (and
indeed mostly appears to make losing this race condition problematic),
I figured I'd try just removing it...
@BusyJay
Copy link
Member

BusyJay commented Aug 8, 2018

Implementing a custom send_all implementation such that the last
send and close are called atomically, getting rid of the window
where the race can be lost.

SendAll itself will close the sink after sending and flushing the last messages in futures 0.1.x.

...a race condition where if the server responds
incredibly quickly, its response may be received between the last
streaming message being sent, and close being called.

This should be an expected behavior. Sink::close only half close the connection, the other peer can still send any messages freely. Although server is also free to reject receiving any further messages too.

I don't think I understand you completely. It would be helpful if you can explain more details about your problem.

@illicitonion
Copy link
Contributor Author

My assumption is that once the server has sent a message, further sends from the client cease to be allowed.

What I believe we're seeing is that the server is responding so quickly that the receiving half of the channel is closed before we start trying to close the sending half, and so we get an RpcFinished error when we try to close the sending half.

(I can reproduce this semi-deterministically - about 20% of attempts with the right server behaviour set up on a localhost server, and can deterministically not reproduce it after this PR is applied)

illicitonion added a commit to twitter/pants that referenced this pull request Aug 13, 2018
This includes tikv/grpc-rs#211 which stops us
seeing RpcFinished errors when the server responds very quickly to
requests.

Unfortunately, this will slow down building pants, as you'll need to
checkout the git repository, but that should be a once-per-machine cost.
illicitonion added a commit to pantsbuild/pants that referenced this pull request Aug 14, 2018
This includes tikv/grpc-rs#211 which stops us
seeing RpcFinished errors when the server responds very quickly to
requests.

Unfortunately, this will slow down building pants, as you'll need to
checkout the git repository, but that should be a once-per-machine cost.
Eric-Arellano pushed a commit to Eric-Arellano/pants that referenced this pull request Aug 18, 2018
commit f239399
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:34:25 2018 -0700

    Remove stray pdb import

commit 3babfb9
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:23:19 2018 -0700

    Add back Py3 tests that now work

commit a46b827
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:22:52 2018 -0700

    Fix backend python issues

commit 88fa18a
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 20:08:42 2018 -0700

    Fix backend project_info issues

commit f4d4a2d
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:54:39 2018 -0700

    Actually fix console separator issues

commit d0c83aa
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:38:39 2018 -0700

    Fix issues with backend native

commit 9a1b3b7
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:33:48 2018 -0700

    Fix most issues with backend jvm

commit 62bc2a4
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 14:47:42 2018 -0700

    Fix logging issue with passing object to format()

commit a4bf759
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 08:05:07 2018 -0700

    Decode test's execute_task() output

commit b81cd24
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 08:03:57 2018 -0700

    Fix issues with backend graph_info

commit 5d28b9a
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 23:13:06 2018 -0700

    Fix console_task's console_separator unicode issues

commit a8674f9
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 22:48:42 2018 -0700

    Fix backend codegen unicode vs bytes issues

commit c893f8c
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 22:47:04 2018 -0700

    Return unicode from stdout in pants integration test

commit f4ab07c
Author: Benjy Weinberger <benjyw@gmail.com>
Date:   Thu Aug 16 19:43:41 2018 -0600

    Make requirements on codegen products optional. (pantsbuild#6357)

    We declare product requirements in various places to force
    codegen to run before, e.g., dependency resolution.

    However currently these requirements are mandatory, so
    deregistering all codegen backends will cause the round engine
    to find no producers of these products, and fail.

    This commit changes these requirements to optional, so
    that if there are no codegen backends, we don't attempt
    to depend on their products.

    Also, minor unrelated documentation nits that weren't worth a
    separate pull request.

commit b0a2469
Author: Eric Arellano <ericarellano@me.com>
Date:   Thu Aug 16 12:08:29 2018 -0700

    Python 3 - fixes to get green contrib (pantsbuild#6340)

    All contrib folders should now be green on Python 2 and Python 3, excluding `googlejavaformat` and `python`.

    This fixes multiple issues causing Py3 failures + adds Py3 to Travis.

commit 070a5c5
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Wed Aug 15 19:30:15 2018 -0700

    Run clippy with nightly rust on CI (pantsbuild#6347)

    This enables all default lints, and a selection of pedantic ones.

    It also fixes all code to be clean with respect to those lints.

    Each commit is independently reviewable.

    It's a shame that the lint selecting can't be shared across creates in some way.

commit 8ce6297
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 20:04:02 2018 -0700

    Use --entry-point not -c when building pex (pantsbuild#6349)

    I deterministically get this error:

    RuntimeError: Ambiguous script specification pants matches multiple entry points:pants = pants.bin.pants_loader:main pants = pants.bin.pants_loader:main

    when building with -c.

    pantsbuild#6267 changed this behaviour,
    but seems broken - this commit allows me to successfully build pexes.

commit 30878d8
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 19:11:19 2018 -0700

    Fix formatting of store.rs (pantsbuild#6350)

    It looks like pantsbuild#6336 merged with bad formatting.

commit 5f2ad63
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 18:40:32 2018 -0700

    Allow pex download path to be overridden (pantsbuild#6348)

commit b7c607f
Author: Stu Hood <stuhood@twitter.com>
Date:   Tue Aug 14 17:40:34 2018 -0700

    Recover from cancelled remote execution RPCs (pantsbuild#6188)

    ### Problem

    It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100.

    ### Solution

    Recover from `RpcStatus::Cancelled` by retrying the same operation.

    ### Result

    Without the Fixes pantsbuild#6100.

commit a96001e
Author: Ity Kaul <ity@users.noreply.github.com>
Date:   Tue Aug 14 16:19:31 2018 -0700

    Download Directory recursively from remote CAS  (pantsbuild#6336)

    ### Problem

    We dont currently have the ability to download a `Directory`, and its contents, recursively.

    ### Solution

    Add `ensure_local_has_recursive_directory()` to be able to download  a `Directory`, and its contents, recursively.

commit e391cd1
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 16:11:22 2018 -0700

    Process execution: Create symlink to JDK on demand (pantsbuild#6346)

    This creates a symlink `.jdk` pointing at the passed JDK, if one was
    passed.

    This is a hack to allow for a remote execution platform to provide a
    pointer at its pre-installed JDK without needing to pollute absolute
    paths into the action definition.

    Long-term, we want the JDK to be part of the input directory for the
    action, but that's a lot of work, and we want to experiment with
    pre-installed JDKs in the mean time.

commit d79b7c3
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 21:26:08 2018 -0700

    Simplify ExecuteProcessRequest construction (pantsbuild#6345)

    Remove create_from_snapshot and create_with_empty_snapshot - they don't
    pull their weight.

    Allow default values for all defaultable values.

commit 33674ce
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 18:57:20 2018 -0700

    Use forked version of grpcio (pantsbuild#6344)

    This includes tikv/grpc-rs#211 which stops us
    seeing RpcFinished errors when the server responds very quickly to
    requests.

    Unfortunately, this will slow down building pants, as you'll need to
    checkout the git repository, but that should be a once-per-machine cost.

commit ccdf523
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 18:36:28 2018 -0700

    ci.sh uses positive rather than negative flags (pantsbuild#6342)

    This makes it much easier to reason about what's happening, and keeps
    .travis.yml more concise.

commit 68f4b59
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 16:12:44 2018 -0700

    Merge directories with identical files (pantsbuild#6343)

    This is useful for merging Snapshots for Targets if multiple targets own
    the same file. This should be rare, but does happen (e.g. multiple
    overlapping resource targets).

commit e189acd
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 09:50:23 2018 -0700

    Set chunk size in process_executor (pantsbuild#6337)

commit 20d267f
Author: Paul <yaup@cs.washington.edu>
Date:   Sun Aug 12 14:41:26 2018 -0700

    added fullpath to fix path concat issue with files when not in git root (pantsbuild#6331)

    ### Problem
    As described in pantsbuild#6301 , when the git root and the build root is not the same, `changed` functionalities can be messed up because `git` returns path relative to the working directory, which produces bad paths when it is [concatenated with the git root](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L161).

    ### Solution
    By adding the `--full-name` tag to [`git ls-files`](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L178), the file concatenation is done properly

    ### Result
    Doing `./pants list --changed...` will concatenate file names correctly when it is not run from the git root.
CMLivingston pushed a commit to CMLivingston/pants that referenced this pull request Aug 27, 2018
This includes tikv/grpc-rs#211 which stops us
seeing RpcFinished errors when the server responds very quickly to
requests.

Unfortunately, this will slow down building pants, as you'll need to
checkout the git repository, but that should be a once-per-machine cost.
@BusyJay
Copy link
Member

BusyJay commented Feb 20, 2019

Can you add a test case for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants