diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md new file mode 100644 index 000000000..4e98336d1 --- /dev/null +++ b/A46-xds-nack-semantics-improvement.md @@ -0,0 +1,124 @@ +A46: xDS NACK Semantics Improvement +---- +* Author(s): Mark D. Roth (markdroth) +* Approver: ejona86, dfawley +* Status: Implemented +* Implemented in: C-core +* Last updated: 2021-09-07 +* Discussion at: https://groups.google.com/g/grpc-io/c/gFYDcWIu9B8 + +## Abstract + +This proposal clarifies xDS NACK semantics used in gRPC. + +## Background + +The [xDS +spec](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol) +says that, at the wire protocol level, a NACK is sent on a per-response +basis, not a per-resource basis. Although not explicitly stated +in the spec, this implies that the client must reject (i.e., not +actually use) even the valid resources from that response, since that +would allow the server to know the client's actual state (although see +[Rationale](#rationale) below). That is the behavior that gRPC currently +implements. + +However, this approach causes problems in a case where a single resource +is invalid and it causes the client to ignore *all* of the resources in +the update, especially for LDS and CDS, where the server must send all +resources in every response. For example, if there is a single invalid +Cluster resource in a CDS response, all of the Cluster resources will +be rejected. If this happens on the first CDS response after a client +starts up (i.e., when the client has not already accepted previous +versions of the CDS resources that it can continue to use), that will +cause the client to have no valid CDS resources at all, which means +that the problem will prevent *all* clusters from functioning instead of +affecting only the invalid resource. This is particularly problematic +now that gRPC shares its XdsClient between channels, because a single +invalid Cluster resource can basically cause all of the client's channels +to stop working all at once. + +This behavior makes it challenging for xDS servers to safely deploy +changes in environments in which the clients are not centrally controlled. +For example, older gRPC clients support only the `ROUND_ROBIN` +LB policy, as per [gRFC A27](A27-xds-global-load-balancing.md), +but newer clients now support the `RING_HASH` policy, as per [gRFC +A42](A42-xds-ring-hash-lb-policy.md). If a Cluster resource specifies +an unsupported LB policy, clients will consider the resource invalid, +which will cause them to NACK the response. So if an xDS server cannot +be sure that all of its clients have been upgraded to a version that +supports the `RING_HASH` policy, then it cannot safely send a Cluster +resource configuring that policy, because that change would cause older +clients to stop functioning. + +Note that this document is not attempting to solve the general problem +that an individual resource that sets a supported field to an unsupported +value will be considered invalid; that behavior is intentional, and there +is nothing we can do to prevent the need for clients to be upgraded to +use a configuration resource that requires features that they do not yet +support. However, this document *is* intending to solve the problem of +that one invalid resource causing other *valid* resources to be ignored. + +This document proposes a behavior change to address this problem. + +### Related Proposals: +* [gRFC A27: xDS-Based Global Load Balancing](A27-xds-global-load-balancing.md) +* [gRFC A42: xDS Ring Hash LB Policy](A42-xds-ring-hash-lb-policy.md) + +## Proposal + +We will change gRPC's behavior such that when a response is NACKed, gRPC +will still use all valid resources from the response; it will ignore +only the invalid resources. + +Note that the xDS wire protocol behavior is not changing at all; the +protocol currently still requires NACKs to be done on a per-response +basis instead of a per-resource basis (although the latter is something +that is expected to be added to the protocol in the future). The only +change is that the client will actually use the valid resources in the +response. + +### Temporary environment variable protection + +N/A + +## Rationale + +Note that gRPC's original behavior was intended to make sure that the +control plane had a clear picture of what configuration the client is +using. However, it turns out that Envoy currently has cases where it +will apply some valid resources from a response that it NACKs, which +means that the control plane *already* did not have that kind of clear +picture. To ensure that the control plane will have that kind of clear +picture, there will be subsequent work to extend the xDS protocol to +allow per-resource NACKing instead of per-response NACKing, although +that will be a broader project. Note that the CSDS service in the gRPC +xDS client (see [gRFC A40](A40-csds-support.md)) can be used to get the +current state of the resources in the client, which is likely to be more +accurate than the view from the xDS server. + +We considered the following alternatives: + +- We could have just waited for the xDS protocol changes that will allow + per-resource NACKing instead of per-response NACKing. However, that + is likely to take more work to design and implement in both clients + and control planes, and we wanted to do something quickly to alleviate + the problem described above. + +- For the case of unknown LB policies specifically, we could fall back + to `ROUND_ROBIN` instead of considering the resource invalid. + However, this is not semantically correct behavior; it could cause + unexpected load on servers and a large number of unnecessary connections + to backends. Also, while it would address the specific case that + triggered this design, it would not address the general case of seeing + an unsupported value in a supported field. + +## Implementation + +C-core implementation: https://github.com/grpc/grpc/pull/27276 + +TODO: Add info for other languages. + +## Open issues (if applicable) + +N/A