From ad6a154a023966b17e2e300864e223e05896ebaa Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 3 Sep 2021 12:11:09 -0700 Subject: [PATCH 1/5] A46: xDS NACK Semantics Improvement --- A46-xds-nack-semantics-improvement.md | 119 ++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100644 A46-xds-nack-semantics-improvement.md diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md new file mode 100644 index 000000000..d1ad75f53 --- /dev/null +++ b/A46-xds-nack-semantics-improvement.md @@ -0,0 +1,119 @@ +A46: xDS NACK Semantics Improvement +---- +* Author(s): Mark D. Roth (markdroth) +* Approver: ejona86, dfawley +* Status: {Draft, In Review, Ready for Implementation, Implemented} +* Implemented in: +* Last updated: 2021-09-03 +* Discussion at: (filled after thread exists) + +## Abstract + +This proposal clarifies xDS NACK semantics used in gRPC. + +## Background + +The [xDS +spec](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol) +says that, at the wire protocol level, a NACK is sent on a per-response +basis, not a per-resource basis. Although not explicitly stated +in the spec, this implies that the client must reject (i.e., not +actually use) even the valid resources from that response, since that +would allow the server to know the client's actual state (although see +[Rationale](#rationale) below). That is the behavior that gRPC currently +implements. + +However, this approach causes problems in a case where a single resource +is invalid and it causes the client to ignore *all* of the resources in +the update, especially for LDS and CDS, where the server must send all +resources in every response. For example, if there is a single invalid +Cluster resource in a CDS response, all of the Cluster resources will +be rejected. If this happens on the first CDS response after a client +starts up (i.e., when the client has not already accepted previous +versions of the CDS resources that it can continue to use), that will +cause the client to have no valid CDS resources at all, which means +that the problem will prevent *all* clusters from functioning instead of +affecting only the invalid resource. This is particularly problematic +now that gRPC shares its XdsClient between channels, because a single +invalid Cluster resource can basically cause all of the client's channels +to stop working all at once. + +This behavior makes it challenging for xDS servers to safely deploy +changes in environments in which the clients are not centrally controlled. +For example, older gRPC clients support only the `ROUND_ROBIN` +LB policy, as per [gRFC A27](A27-xds-global-load-balancing.md), +but newer clients now support the `RING_HASH` policy, as per [gRFC +A42](A42-xds-ring-hash-lb-policy.md). If a Cluster resource specifies +an unsupported LB policy, clients will consider the resource invalid, +which will cause them to NACK the response. So if an xDS server cannot +be sure that all of its clients have been upgraded to a version that +supports the `RING_HASH` policy, then it cannot safely send a Cluster +resource configuring that policy, because that change would cause older +clients to stop functioning. + +Note that this document is not attempting to solve the general problem +that an individual resource that sets a supported field to an unsupported +value will be considered invalid; that behavior is intentional, and there +is nothing we can do to prevent the need for clients to be upgraded to +use a configuration resource that requires features that they do not yet +support. However, this document *is* intending to solve the problem of +that one invalid resource causing other *valid* resources to be ignored. + +This document proposes a behavior change to address this problem. + +### Related Proposals: +* [gRFC A27: xDS-Based Global Load Balancing](A27-xds-global-load-balancing.md) +* [gRFC A42: xDS Ring Hash LB Policy](A42-xds-ring-hash-lb-policy.md) + +## Proposal + +We will change gRPC's behavior such that when a response is NACKed, gRPC +will still use all valid resources from the response; it will ignore +only the invalid resources. + +Note that the xDS wire protocol behavior is not changing at all; the +protocol currently still requires NACKs to be done on a per-response +basis instead of a per-resource basis (although the latter is something +that is expected to be added to the protocol in the future). The only +change is that the client will actually use the valid resources in the +response. + +### Temporary environment variable protection + +N/A + +## Rationale + +Note that gRPC's original behavior was intended to make sure that the +control plane had a clear picture of what configuration the client is +using. However, it turns out that Envoy currently has cases where it +will apply some valid resources from a response that it NACKs, which +means that the control plane *already* did not have that kind of clear +picture. To ensure that the control plane will have that kind of clear +picture, there will be subsequent work to extend the xDS protocol to +allow per-resource NACKing instead of per-response NACKing, although +that will be a broader project. + +We considered the following alternatives: + +- We could have just waited for the xDS protocol changes that will allow + per-resource NACKing instead of per-response NACKing. However, that + is likely to take more work to design and implement in both clients + and control planes, and we wanted to do something quickly to alleviate + the problem described above. + +- For the case of unknown LB policies specifically, we could fall back + to `ROUND_ROBIN` instead of considering the resource invalid. + However, this is not semantically correct behavior; it could cause + unexpected load on servers and a large number of unnecessary connections + to backends. Also, while it would address the specific case that + triggered this design, it would not address the general case of seeing + an unsupported value in a supported field. + +## Implementation + +[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] + +## Open issues (if applicable) + +N/A From 1d5e4810f5af2ccec8f5713890c404dbe2aef6a1 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 3 Sep 2021 14:23:20 -0700 Subject: [PATCH 2/5] add discussion link --- A46-xds-nack-semantics-improvement.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md index d1ad75f53..c39939fc3 100644 --- a/A46-xds-nack-semantics-improvement.md +++ b/A46-xds-nack-semantics-improvement.md @@ -5,7 +5,7 @@ A46: xDS NACK Semantics Improvement * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: * Last updated: 2021-09-03 -* Discussion at: (filled after thread exists) +* Discussion at: https://groups.google.com/g/grpc-io/c/gFYDcWIu9B8 ## Abstract From c7ec4c99730af14dad2282df1547271d4e3316f5 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 3 Sep 2021 14:26:01 -0700 Subject: [PATCH 3/5] add note about CSDS --- A46-xds-nack-semantics-improvement.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md index c39939fc3..c08f5f64e 100644 --- a/A46-xds-nack-semantics-improvement.md +++ b/A46-xds-nack-semantics-improvement.md @@ -92,7 +92,10 @@ means that the control plane *already* did not have that kind of clear picture. To ensure that the control plane will have that kind of clear picture, there will be subsequent work to extend the xDS protocol to allow per-resource NACKing instead of per-response NACKing, although -that will be a broader project. +that will be a broader project. Note that the CSDS service in the gRPC +xDS client (see [gRFC A40](A40-csds-support.md)) can be used to get the +current state of the resources in the client, which is likely to be more +accurate than the view fromthe xDS server. We considered the following alternatives: From f88fd306dd53c700813eea28ea057d1645d98e6b Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 7 Sep 2021 10:26:24 -0700 Subject: [PATCH 4/5] fix typo --- A46-xds-nack-semantics-improvement.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md index c08f5f64e..ac0294d45 100644 --- a/A46-xds-nack-semantics-improvement.md +++ b/A46-xds-nack-semantics-improvement.md @@ -95,7 +95,7 @@ allow per-resource NACKing instead of per-response NACKing, although that will be a broader project. Note that the CSDS service in the gRPC xDS client (see [gRFC A40](A40-csds-support.md)) can be used to get the current state of the resources in the client, which is likely to be more -accurate than the view fromthe xDS server. +accurate than the view from the xDS server. We considered the following alternatives: From be44a7a7706eeadad7fcbae35a5c8e4800735ebe Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 7 Sep 2021 16:00:48 -0700 Subject: [PATCH 5/5] add link to C-core implementation PR --- A46-xds-nack-semantics-improvement.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/A46-xds-nack-semantics-improvement.md b/A46-xds-nack-semantics-improvement.md index ac0294d45..4e98336d1 100644 --- a/A46-xds-nack-semantics-improvement.md +++ b/A46-xds-nack-semantics-improvement.md @@ -2,9 +2,9 @@ A46: xDS NACK Semantics Improvement ---- * Author(s): Mark D. Roth (markdroth) * Approver: ejona86, dfawley -* Status: {Draft, In Review, Ready for Implementation, Implemented} -* Implemented in: -* Last updated: 2021-09-03 +* Status: Implemented +* Implemented in: C-core +* Last updated: 2021-09-07 * Discussion at: https://groups.google.com/g/grpc-io/c/gFYDcWIu9B8 ## Abstract @@ -115,7 +115,9 @@ We considered the following alternatives: ## Implementation -[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] +C-core implementation: https://github.com/grpc/grpc/pull/27276 + +TODO: Add info for other languages. ## Open issues (if applicable)