Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior for using WAMP features not agreed during feature negotiation #1605

Open
tomelliff opened this issue Dec 12, 2022 · 9 comments

Comments

@tomelliff
Copy link

tomelliff commented Dec 12, 2022

I have a client talking to a WAMP Basic Profile server (ran by a third party). This doesn't support call cancellation so I think a call made like this:

# ...
    async def call_endpoint(self, endpoint):
        try:
            resp = await self.call(endpoint, timeout=ENDPOINT_CALL_TIMEOUT_SECONDS * 1000)
        except Exception as e:  # TODO: Tighten this exception
            print(f"Error calling {endpoint}: {e}")
            raise

would ignore the timeout option and carry on running past the timeout?

I didn't actually notice the timeout option on the call method on my first pass so just wrapped it in asyncio.wait_for with a timeout like this:

# ...
    async def call_endpoint(self, endpoint):
        try:
            resp = await asyncio.wait_for(self.call(endpoint), timeout=ENDPOINT_CALL_TIMEOUT_SECONDS)
        except (asyncio.TimeoutError) as e:
            print(f"Timeout calling {endpoint}: {e}")
        except Exception as e:  # TODO: Tighten this exception
            print(f"Error calling {endpoint}: {e}")
            raise

But that's throwing the occasional bad_protocol error. I didn't know what the exact issue is but our third party gave me access to the server logs which showed a 49 error code relating to the cancel call and it does always align with being thrown on the call after a 20s timed out call.

What's the best way to handle timeouts in this situation? I don't want to catch the bad_protocol exception as is in case the client or server changes in a way that this hides another issue. Can I at least see why the bad_protocol exception is being thrown in the exception somehow? I also don't want to allow the client to block indefinitely on the server because we previously saw consistent failures where the client just hangs forever waiting for a response from the server that never arrives.

Should Autobahn be detecting this lack of support for call cancellation and just silently dropping the task if cancelled in this way? Or throw another exception in some way that can be handled?

I noticed #1127 but while I think there's maybe some similarity in the reply in that issue I don't really see the similarity in the opening post.

@oberstet
Copy link
Contributor

What's the best way to handle timeouts in this situation?

Best would be to switch to a WAMP router that support CANCEL https://wamp-proto.org/wamp_latest_ietf.html#name-call-canceling

Should Autobahn be detecting this lack of support for call cancellation

Autobahn must announce all features it supports

https://wamp-proto.org/wamp_latest_ietf.html#section-4.1.1-10
https://wamp-proto.org/wamp_latest_ietf.html#section-4.1.2-7
https://wamp-proto.org/wamp_latest_ietf.html#name-feature-announcement

and the router must reply with a subset of the client announced feature, a subset further limited by the features the router supports or wants to support for that client

once features are negotiated, any use of a feature that wasn't agreed ... yeah, well: either should be a protocol error or should be ignored ... but which? tbh, I'm not sure we have appropriate spec text, and I'm not sure what Autobahn does in specific cases (like call canceling) :(

marking this a "bug" for further investigation ... even though it is unclear what exactly is and should happen ...

@oberstet oberstet changed the title Cancellation via asyncio.wait_for throws bad protocol error when server does not support call cancellation Undefined behavior for using WAMP features not agreed during feature negotiation Dec 12, 2022
@tomelliff
Copy link
Author

tomelliff commented Dec 12, 2022

Thanks for the quick response.

I don't have control of the server side but I can pass that information back to the third party as a suggestion. AFAIK this is written in C++ on minimal hardware so that might be why they're running a more minimal implementation and not handling things like call cancellation. Possibly also a concern around code footprint as well because I think we're the first external users of this API.

Could you clarify what happens when an Autobahn client has a timeout via the call normally (eg the first block in my first post) but the server doesn't support it? Does the client just ignore it and carry on waiting? Or does it throw a protocol error like is happening with the asyncio.wait_for call?

And does the client receive enough information that it can say what the protocol error was on? Right now I just get a bad protocol exception and no further information as to what caused it. That exception also gets thrown on the following call rather than getting thrown ahead of, or as part of, the TimeoutError which is more what I'd expect and would give me something I could catch in the same way.

@oberstet
Copy link
Contributor

oberstet commented Dec 12, 2022

by deault, autobahn will announce

# set client role features supported and announced

support for call cancelling

'caller': RoleCallerFeatures(

but then use the feature without qualification (regardless of weather it was indeed negotiated as active)

def canceller(d):


the correct behavior for autobahn might be: if a feature wasn't actually negotiated as active during negotiation with the router, then throw a client-side exception without sending anything to the server when the feature is used

in this case it would mean: you get an exception when calling "cancel" on the client side.

should we want to implement that behavior, we need to take care: since canceling a call cannot be done at the router, the initiation of a cancel should fail for the client, but what about the still on-going remote call? if we fail the client side of the call when the cancel fails, then what about the result we'll receive later? we could throw it away client side (but this needs more code, since we check for results for no-longer active calls and bail out). we could also not fail the call but silently ignore the request to initiate canceling remotely ...

@KSDaemon
Copy link

Well... The CALL may be canceled on different sides. But in the context of the original question — i think that the peer should just check that dealer supports the required feature before use.
E.g.:

  • Do not allow to call of any RPC if the router provides only broker role
  • Do not allow sending payload in PPT mode, if the router doesn't announce payload_passthru_mode feature

In this case — the client side immediately receives the error, the wamp connection remains intact.

@oberstet
Copy link
Contributor

i think that the peer should just check that dealer supports the required feature before use.

the problem is: the use of the specific subfeature CANCEL happens when the original CALL feature already is in use

IOW: the client cancel is a function on the deferred/future for the already issued call.

In this case — the client side immediately receives the error, the wamp connection remains intact.

ok! that means:

if we fail the client side of the call when the cancel fails, then what about the result we'll receive later?

  1. we could throw it away client side (but this needs more code, since we check for results for no-longer active calls and bail out).
  2. we could also not fail the call but silently ignore the request to initiate canceling remotely ...

so you argue for 1., right?

@oberstet
Copy link
Contributor

oberstet commented Dec 12, 2022

The CALL may be canceled on different sides.

autobahn(python) currently only implements router-side call canceling - not (pure or additional) client-side call canceling. callee side call canceling (the 3rd of the 3 possible sides) is callee (and router) business. (and that should be possible in autobahn

invoked = self._invocations[msg.request]
)

"cancel" is done in autobahn via

def canceller(d):

which uses

https://github.com/crossbario/txaio/blob/e00e3f079139dc32805fc595044138e1c0114f40/txaio/tx.py#L345

which (for twisted) points to

https://docs.twistedmatrix.com/en/stable/api/twisted.internet.defer.Deferred.html#cancel

which doesn't take arguments.

so for an autobahn implementation, this would mean expanding txaio to allow to eg do d.cancel(timeout=...), or to correctly process the option already present for the initial call in

class CallOptions(object):

neither is the case currently.

@KSDaemon
Copy link

KSDaemon commented Dec 13, 2022

O, thanks for description how it works under the hood

the problem is: the use of the specific subfeature CANCEL happens when the original CALL feature already is in use

IOW: the client cancel is a function on the deferred/future for the already issued call.

if we fail the client side of the call when the cancel fails, then what about the result we'll receive later?

  1. we could throw it away client side (but this needs more code, since we check for results for no-longer active calls and bail out).
  2. we could also not fail the call but silently ignore the request to initiate canceling remotely ...

so you argue for 1., right?

Well, I think both points are legitimate and just complement each other.

The client wants to cancel the call. It just issues cancel(). Maybe just wrapping it in try-catch. But what mean if cancel fails for client code? I think not much. After issuing cancel() client just doesn't care about results anymore. And that should work transparently for the client independently of what router is in use, do other peers supports call canceling or not. That's the nature of WAMP: one peer doesn't know much and actually doesn't care much about other peers: where they are running and how.
And call canceling is one of those features that can be implemented in a progressive degradation manner.
So even if other peers do not support it we should handle that on our side. So for client, the logic remains the same for feature-rich router or basic one.

What if the client issues call canceling, and we return an error as a feature not supported — okay. What's then? Client still need to wait for the results? Or we will receive results sometime in the future and fail the whole connection because there is no related call? I think that's not so convenient for end user and there is no reason to fail the whole connection.

@oberstet
Copy link
Contributor

oberstet commented Dec 13, 2022

Thanks for your thoughts - makes sense.

That's the nature of WAMP: one peer doesn't know much and actually doesn't care much about other peers: where they are running and how.

Yes, I agree, that what always what we aimed for in WAMP: make network and deployment details disappear from the application layer. It's just a glue between distributed app code.

What if the client issues call canceling, and we return an error as a feature not supported — okay. What's then? Client still need to wait for the results? Or we will receive results sometime in the future and fail the whole connection because there is no related call? I think that's not so convenient for end user and there is no reason to fail the whole connection.

I agree, raising "not supported" and "still need to wait" or "fail the whole connection" when the client explicitly triggered "cancel" and clearly wanted to go on with different tasks seems unhelpful. From app dev point of view.

We should aim to do what is most helpful for the app developer.


Trying to summarize the desired behavior:

  1. the app developer connecting the client is announcing "enable/support feature X", eg X == "remote call canceling"
  2. assume the peer answers with "X is enabled/supported" indicating feature support (A)
  3. later, when app dev uses X, it turns out the other peer doesn't actually support X (in general, or at this time)
  4. in case of X == "remote call canceling", the canceled call (with the cancel issued that didn't work) is failed client side nevertheless .. it was actively canceled after all. the client side code for the call hence fails with an appropriate exception (B) as it would if the remote cancel would have worked
  5. anything the peer might still return in the future (C) is simply ignored and thrown away client side

I have marked the 3 remaining details to discuss for nailing everything IMO:

(A): feature announcement in WAMP is hence to be understood purely in an "indicative", not "binding" manner. the peer is just announcing it is not actively opposing/denying a feature X. when a peer announces feature X, that doesn't necessarily mean it is supported all the time => if so, I guess we should add some text to the spec?

(B): autobahn will raise an app code exception from a call that was canceled where the remote-call-cancel worked technically. should the exception raised when remote-call-cancel didn't work technically (but still is canceled client side) be eg a subclass of the former current one? => we'll need that for the code in autobahn

(C): autobahn currently rigorously matches incoming messages with "expected return slots" like for an issued but still ongoing call. if no slot is there, a message received for that slots ID is a protocol violation. I definitely want to preserve that. the problem then is: a canceled call is removed from the slots, and when the peer then sends an reply, this will run into "protocol error" ... if the slots is kept open, for how long? otherwise it's a memory leak ... mmh

@tomelliff
Copy link
Author

I don't know enough about WAMP/Autobahn to really weigh in here on your design discussion but I was curious about why I receive the error on the follow up call and not when the previous call is cancelled?

I'd expect to have something like this:

sequenceDiagram
    Client->>+Server: Call RPC A
    Note over Server: Server blocks
    Note over Client: Client times out the call of RPC A
    Client->>+Server: Cancel RPC A
    Server-->>+Client: Return bad protocol for unsupported cancellation
    Note over Client: Client throws bad protocol error

but it feels more like this:

sequenceDiagram
    Client->>+Server: Call RPC A
    Note over Server: Server blocks
    Note over Client: Client times out the call of RPC A
    Client->>+Server: Cancel RPC A
    Client->>+Server: Call RPC B
    Server-->>+Client: Return bad protocol for unsupported cancellation
    Note over Client: Client throws bad protocol error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants