Skip to content

Latest commit

 

History

History
918 lines (792 loc) · 46.3 KB

A61-IPv4-IPv6-dualstack-backends.md

File metadata and controls

918 lines (792 loc) · 46.3 KB

A61: IPv4 and IPv6 Dualstack Backend Support

Abstract

gRPC clients currently support both IPv4 and IPv6. However, most implementations do not have support for individual backends that have both an IPv4 and IPv6 address. It is desirable to natively support such backends in a way that correctly interacts with load balancing.

Background

For background on the interaction between the resolver and LB policy in the gRPC client channel, see Load Balancing in gRPC.

In most gRPC implementations, the resolver returns a flat list of addresses, where each address is assumed to be a different endpoint, and the LB policy is expected to balance the load across those endpoints. The list of addresses can include both IPv4 and IPv6 addresses, but it has no way to represent the case where two addresses point to the same endpoint, so the LB policy will treat them as two different endpoints, sending each one its own share of the load. However, the actual desired behavior in this case is for the LB policy to use only one of the addresses for each endpoint at any given time. (Note that gRPC Java already supports this.)

Also, when connecting to an endpoint with multiple addresses, it is desirable to use the "Happy Eyeballs" algorithm described in RFC-8305 to minimize the time it takes to establish a working connection by parallelizing connection attempts in a reasonable way. Currently, all gRPC implementations perform connection attempts in a completely serial manner in the pick_first LB policy.

This work is being done in conjunction with an effort to add multiple addresses per endpoint in xDS. We will support the new xDS APIs being added for that effort as well. Note that this change has implications for session affinity behavior in xDS.

Related Proposals:

Proposal

This proposal includes several parts:

  • Allow resolvers to return multiple addresses per endpoint.
  • Implement Happy Eyeballs. This will be done in the pick_first LB policy, which will become the universal leaf policy. It will also need to support client-side health checking. In Java and Go, the pick_first logic will be moved out of the subchannel and into the pick_first policy itself.
  • In xDS, we will support the new fields in EDS to indicate multiple addresses per endpoint, and we will extend the stateful session affinity mechanism to support such endpoints.

Allow Resolvers to Return Multiple Addresses Per Endpoint

Instead of returning a flat list of addresses, the resolver will be able to return a list of endpoints, each of which can have multiple addresses.

Because DNS does not have a way to indicate which addresses are associated with the same endpoint, the DNS resolver will return each address as a separate endpoint.

Attributes Returned by the Resolver

All gRPC implementations have a mechanism for the resolver to return arbitrary attributes to be passed to the LB policies. Attributes can be set at the top level, which is used for things like passing the XdsClient instance from the resolver to the LB policies (as described in gRFC A27), or per-address, which is used for things like passing hierarchical address information down through the LB policy tree (as described in gRFC A56).

The exact semantics for these attributes currently vary across languages. This proposal does not attempt to define unified semantics for these attributes, although another proposal may attempt that in the future. For now, this proposal only defines the required changes of this interface in the wake of supporting multiple addresses per endpoint.

Specifically, the resolver API must provide a mechanism for passing attributes on a per-endpoint basis. Most of the attributes that are currently per-address will now be per-endpoint instead. Implementations may also support per-address attributes, but this is not required.

Happy Eyeballs in the pick_first LB Policy

The pick_first LB policy currently attempts to connect to each address serially, stopping at the first one that succeeds. We will change it to instead use the Happy Eyeballs algorithm on the initial pass through the address list. Specifically:

  • As per RFC-8305 section 5, the default Connection Attempt Delay value is 250ms. Implementations may provide a channel arg to control this value, although they must be between the recommended lower bound of 100ms and upper bound of 2s. Any value lower than 100ms should be treated as 100ms; any value higher than 2s should be treated as 2s.
  • Whenever we start a connection attempt on a given address, if it is not the last address in the list, we start a timer for the Connection Attempt Delay.
  • If the timer fires before the connection attempt completes, we will start a connection attempt on the next address in the list. Note that we do not interrupt the previous connection attempt that is still in flight; at this point, we will have in-flight connection attempts to multiple addresses at once. Also note that, as per the previous bullet, we will once again start a timer if this new address is not the last address in the list.
  • The first time any connection attempt succeeds (i.e., the subchannel reports READY, which happens after all handshakes are complete), we choose that connection. If there is a timer running, we cancel the timer.
  • We will wait for at least one connection attempt on every address to fail before we consider the first pass to be complete. At that point, we will request re-resolution. As per gRFC A62, we will report TRANSIENT_FAILURE state and will continue trying to connect. We will stay in TRANSIENT_FAILURE until either (a) we become connected or (b) the LB policy is destroyed by the channel shutting down or going IDLE.

If the first pass completes without a successful connection attempt, we will switch to a mode where we keep trying to connect to all addresses at all times, with no regard for the order of the addresses. Each individual subchannel will provide backoff behavior, reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff has finished. The pick_first policy will therefore automatically request a connection whenever a subchannel reports IDLE. We will count the number of connection failures, and when that number reaches the number of subchannels, we will request re-resolution; note that because the backoff state will differ across the subchannels, this may mean that we have seen multiple failures of a single subchannel and no failures from another subchannel, but this is a close enough approximation and very simple to implement.

Note that every time the LB policy receives a new address list, it will start an initial Happy Eyeballs pass over the new list, even if some of the subchannels are not actually new due to their addresses having been present on both the old and new lists. This means that on the initial pass through the address list for a subsequent address list update, when pick_first decides to start a connection attempt on a given subchannel (whether because it is the first subchannel in the list or because the timer fired before the previous address' connection attempt completed), that subchannel may not be in state IDLE, which is the only state in which a connection attempt may be requested. (Note: This same problem may occur in C-core even on the first address list update, due to subchannels being shared with other channels.) Therefore, when we are ready to start a connection attempt on a given subchannel:

  • If the subchannel is in state IDLE, we request a connection attempt immediately. If it is not the last subchannel in the list, we will start the timer; if it is the last subchannel in the list, we will wait for the attempt to complete.
  • If the subchannel is in state CONNECTING, we do not need to actually request a connection, but we will treat it as if we did. If it is not the last subchannel in the list, we will start the timer; if it is the last subchannel in the list, we will wait for the attempt to complete.
  • If the subchannel is in state TRANSIENT_FAILURE, then we know that it is in backoff due to a recent connection attempt failure, so we treat it as if we have already made a connection attempt on this subchannel, and we will immediately move on to the next subchannel.

Note that because we do not report TRANSIENT_FAILURE until after the Happy Eyeballs pass has completed and we start a new Happy Eyeballs pass whenever we receive a new address list, there is a potential failure mode where we may never report TRANSIENT_FAILURE if we are receiving new address lists faster than we are completing Happy Eyeballs passes. This is a pre-existing problem, and each gRPC implementation currently deals with it in its own way. This design does not propose any changes to those existing approaches, although a future gRFC may attempt to achieve further convergence here.

Once a subchannel does become READY, pick_first will unref all other subchannels, thus cancelling any connection attempts that were already in flight. Note that the connection backoff state is stored in the subchannel, so this means that we will lose backoff state for those subchannels (but see note for C-core below). In general, this is expected to be okay, because once we see a READY subchannel, we generally expect to maintain that connection for a while, after which the backoff state for the other subchannels will no longer be relevant. However, there could be pathalogical cases where a connection does not last very long and we wind up making subsequent connection attempts to the other addresses sooner than we ideally should. This should be fairly rare, so we're willing to accept this; if it becomes a problem, we can find ways to address it at that point.

Implications of Subchannel Sharing in C-core

In C-core, there are some additional details to handle due to the existance of subchannel sharing between channels. Any given subchannel that pick_first is using may also be used other channel(s), and any of those other channels may request a connection on the subchannel at any time. This means that pick_first needs to be prepared for the fact that any subchannel may report any connectivity state at any time (even at the moment that pick_first starts using the subchannel), even if it did not previously request a connection on the subchannel itself. This has a couple of implications:

  • pick_first needs to be prepared for any subchannel to report READY at any time, even if it did not previously request a connection on that subchannel. Currently (prior to this design), pick_first immediately chooses the first subchannel that reports READY. That behavior seems consistent with the intent of Happy Eyeballs, so we will retain it.
  • When we choose a subchannel that has become successfully connected, we will unref all of the other subchannels. For any subchannel on which we were the only channel holding a ref, this will cause any pending connection attempt to be cancelled, and the subchannel will be destroyed. However, if some other channel was holding a ref to the subchannel, the connection attempt will continue, even if the other channel did not want it. This is slightly sub-optimal, but it's not really a new problem; the same thing can occur today if there are two channels both using pick_first with overlapping sets of addresses. We can find ways to address this in the future if and when it becomes a problem.

Move pick_first Logic Out of Subchannel (Java/Go)

In Java and Go, the pick_first logic is currently implemented in the subchannel. We will pull this logic out of the subchannel and move it into the pick_first policy itself. This means that subchannels will have only one address, and that address does not change over the lifetime of the subchannel. It will also mean that connection backoff will be done on a per-address basis rather than a per-endpoint basis. This will move us closer to having uniform architecture across all of our implementations.

Use pick_first as the Universal Leaf Policy

There are two main types of LB policies in gRPC: leaf policies, which directly interact with subchannels, and parent policies, which delegate to other LB policies. Happy Eyeballs support is necessary only in leaf policies.

Because we do not want to implement Happy Eyeballs multiple times, we will implement it only in pick_first, and we will change all other leaf policies to delegate to pick_first instead of directly interacting with subchannels. This set of policies, which we will refer to as "petiole" policies, includes the following:

The petiole policies will receive a list of endpoints, each of which may contain multiple addresses. They will create a pick_first child policy for each endpoint, to which they will pass a list containing a single endpoint with all of its addresses. (See below for more details on individual petiole policies.)

Note that implementations should be careful to ensure that this change does not make error messages less useful when a pick fails. For example, today, when round_robin has all of its subchannels in state TRANSIENT_FAILURE, it can return a picker that fails RPCs with the error message reported by one of the subchannels (e.g., "failed to connect to all addresses; last error: ipv4:127.0.0.1:443: Failed to connect to remote host: Connection refused"), which tends to be more useful than just saying something like "all subchannels failed". With this change, round_robin will be delegating to pick_first instead of directly interacting with subchannels, and the LB policy API in many gRPC implementations does not have a mechanism to report an error message along with the connectivity state. In those implementations, it may be necessary for round_robin to return a picker that delegates to one of the pick_first children's pickers, possibly modifying the error message from the child picker before returning it to the channel.

Address List Handling in pick_first

As mentioned above, we are changing the LB policy API to take an address list that contains a list of endpoints, each of which can contain one or more addresses. However, the Happy Eyeballs algorithm assumes a flat list of addresses, not this two-dimensional list. To address that, we need to define how pick_first will flatten the list. We also need to define how that flattening interacts with both the sorting described in RFC-8305 section 4 and with the optional shuffling described in gRFC A62.

There are three cases to consider here:

A. If pick_first is used under a petiole policy, it will see a single endpoint with one or more addresses.

B. If pick_first is used as the top-level policy in the channel with the DNS resolver, it will see one or more endpoints, each of which have exactly one address. It should be noted that the DNS resolver does not actually know which addresses might or might not be associated with the same endpoint, so it assumes that each address is a separate endpoint.

C. If pick_first is used as the top-level policy in the channel with a custom resolver implementation, it may see more than one endpoint, each of which has one or more addresses.

RFC-8305 section 4 says to perform RFC-6724 sorting first. In gRPC, that sorting happens in the DNS resolver before the address list is passed to the LB policy, so it will already be done by the time pick_first sees the address list.

When the pick_first policy sees an address list, it will perform these steps in the following order:

  1. Perform the optional shuffling described in gRFC A62. The shuffling will change the order of the endpoints but will not touch the order of the addresses within each endpoint. This means that the shuffling will work for cases B and C above, but it will not work for case A; this is expected to be the right behavior, because we do not have or anticipate any use cases where a petiole policy will need to enable shuffling.

  2. Flatten the list by concatenating the ordered list of addresses for each of the endpoints, in order.

  3. In the flattened list, interleave addresses from the two address families, as per RFC-8305 section 4. Doing this on the flattened address list ensures the best behavior if only one of the two address families is working.

Generic Health Reporting Mechanism

gRPC currently supports two mechanisms that provide a health signal for a connection: client-side health checking, as described in gRFC A17, and outlier detection, as described in gRFC A50. Currently, both mechanisms signal unhealthiness by essentially causing the subchannel to report TRANSIENT_FAILURE to the leaf LB policy. However, that approach will no longer work with this design, as explained in the Reaons for Generic Health Reporting section below.

Instead, we need to make these health signals visible to the petiole policies without affecting the underlying connectivity management of the pick_first policy. However, since both of these mechanisms work on individual subchannels rather than on endpoints with multiple subchannels, this functionality is best implemented in pick_first itself, since that's where we know which subchannel was actually chosen. Therefore, pick_first will have an option to support these health signals, and that option will be used only when pick_first is used as a child policy underneath a petiole policy.

Note that we do not want either of these mechanisms to actually work when pick_first is used as an LB policy by itself, so we will implement this functionality in a way that it can be triggered by a parent policy such as round_robin but cannot be triggered by an external application. (For example, in C-core, this will be triggered via an internal-only channel arg that will be set by the petiole policies.)

When this option is enabled in pick_first, it will be necessary for pick_first to see both the "raw" connectivity state of each subchannel and the state reflected by health checking. The connection management behavior will continue to use the "raw" connectivity state, just as it does today. Only once pick_first chooses a subchannel will it start the health watch, and the connectivity state reported by that watch is the state that pick_first will report to its parent.

Although we need pick_first to be aware of the chosen subchannel's health, we do not want it to have to be specifically aware of individual health-reporting mechanisms like client-side health checking or outlier detection (or any other such mechanism that we might add in the future). As a result, we will structure this as a general-purpose health-reporting watch that will be started by pick_fist without regard to whether any individual health-reporting mechanism is actually configured. If no health-reporting mechanisms are actually configured, the watch will report the subchannel's raw connectivity state, so it will effectively be a no-op.

Address List Updates in Petiole Policies

The algorithm used by petiole policies to handle address list updates will need to be updated to reflect the new two-level nature of address lists.

Currently, there are differences between C-core and Java/Go in terms of how address list works are handled, so we need to specify how each approach works and how it is going to be changed.

Address List Updates in C-core

In C-core, the channel provides a subchannel pool, which means that if an LB policy creates multiple subchannels with the same address and channel args, both of the returned subchannel objects will actually be refs to the same underlying real subchannel.

As a result, the normal way to handle an address list update today is to create a whole new list of subchannels, ignoring the fact that some of them may be duplicates of subchannels in the previous list; for those duplicates, the new list will just wind up getting a new ref to the existing subchannel, so there will not be any connection churn. Also, to avoid adding unnecessary latency to RPCs being sent on the channel, we wait to actually start using the new list until we have seen the initial connectivity state update on all of those subchannels and they have been given the chance to get connected, if necessary.

With the changes described in this proposal, we will continue to take the same basic approach, except that for each endpoint, we will create a pick_first child policy instead of creating a subchannel. Note that the subchannel pool will still be used by all pick_first child policies, so creating a new pick_first child in the new list for the same address that is already in use by a pick_first child in the old list will wind up reusing the existing connection.

Address List Updates in Java/Go

In Java and Go, there is no subchannel pool, so when an LB policy gets an updated address list, it needs to explicitly check whether any of those addresses were already present on its previous list. It effectively does a set comparison: for any address on the new list that is not on the old list, it will create a new subchannel; for any address that was on the old list but is not on the new list, it will remove the subchannel; and for any address on both lists, it will retain the existing subchannel.

This algorithm will continue to be used, with the difference that each entry in the list will now be a set of one or more addresses rather than a single address. Note that the order of the addresses will not matter when determining whether an endpoint is present on the list; if the old list had an endpoint with address list [A, B] and the new list has an endpoint with address list [B, A], that endpoint will be considered to be present on both lists. However, because the order of the addresses will matter to the pick_first child when establishing a new connection, the petiole policy will need to send an updated address list to the pick_first child to ensure that it has the updated order.

Note that in this algorithm, the unordered set of addresses must be the same on both the old and new list for an endpoint to be considered the same. This means that if an address is added or removed from an existing endpoint, it will be considered a completely new endpoint, which may cause some unnecessary connection churn. For this design, we are accepting this limitation, but we may consider optimizing this in the future if it becomes a problem.

Except for the cases noted below (Ring Hash and Outlier Detection), it is up to the implementation whether a given LB policy takes resolver attributes into account when comparing endpoints from the old list and the new list.

Weighted Round Robin

In the weighted_round_robin policy described in gRFC A58, some additional state is needed to track the weight of each endpoint.

WRR in C-core

In C-core, WRR currently has a map of address weights, keyed by the associated address. The weight objects are ref-counted and remove themselves from the map when their ref-count reaches zero. When a subchannel is created for a given address, it takes a new ref to the weight object for its address. This structure allows the weight information to be retained when we create a new subchannel list in response to an updated address list.

With the changes in this proposal, this map will instead be keyed by the unordered set of addresses for each endpoint. This will use the same semantics as address list updates in Java/Go, described above: an endpoint on the old list with addresses [A, B] will be considered identical to an endpoint on the new list with addresses [B, A].

Note that in order to start the ORCA OOB watcher for backend metrics on the subchannel (see gRFC A51), WRR will need to intercept subchannel creation via the helper that it passes down into the pick_first policy. It will unconditionally start the watch for each subchannel as it is created, all of which will update the same subchannel weight. However, once pick_first chooses a subchannel, it will unref the other subchannels, so only one OOB watcher will remain in steady state.

WRR in Java/Go

In Java and Go, WRR stores the subchannel weight in the individual subchannel. We will continue to use this same structure, except that instead of using a map from a single address to a subchannel, we will store a map from an unordered set of addresses to a pick_first child, and the endpoint weight will be stored alongside that pick_first child.

Just like in C-core, in order to start the ORCA OOB watcher for backend metrics on the subchannel, WRR will need to intercept subchannel creation via the helper that it passes down into the pick_first policy. However, unlike C-core, Java and Go will need to wrap the subchannels and store them, so that they can start or stop the ORCA OOB watcher as needed by a subsequent config change.

Least Request

The least-request LB policy (Java and Go only, described in gRFC A48) will work essentially the same way as WRR. The only difference is that the data it is storing on a per-endpoint basis is outstanding request counts rather than weights.

Ring Hash

Currently, as described in gRFC A42, each entry in the ring is a single address, positioned based on the hash of that address. With this design, that will change such that each entry in the ring is an endpoint, positioned based on the hash of the endpoint's first address. However, once an entry in the ring is selected, we may wind up connecting to the endpoint on a different address than the one that we hashed to.

Note that this means that if the order of the addresses for a given endpoint change, that will change the position of the endpoint in the ring. This is considered acceptable, since ring_hash is already subject to churn in the ring whenever the address list changes.

Because ring_hash establishes connections lazily, but pick_first will attempt to connect as soon as it receives its initial address list, the ring_hash policy will lazily create the pick_first child when it wants to connect.

Note that as of gRFC A62, pick_first has sticky-TF behavior in all languages: when a connection attempt fails, it continues retrying indefinitely with appropriate backoff, staying in TRANSIENT_FAILURE state until either it establishes a connection or the pick_first policy is destroyed. This means that the ring_hash picker no longer needs to explicitly trigger connection attempts on subchannels in state TRANSIENT_FAILURE, which makes the logic much simpler. The picker pseudo-code now becomes:

first_index = ring.FindIndexForHash(request.hash);
for (i = 0; i < ring.size(); ++i) {
  index = (first_index + i) % ring.size();
  if (ring[index].state == READY) {
    return ring[index].picker->Pick(...);
  }
  if (ring[index].state == IDLE) {
    ring[index].endpoint.TriggerConnectionAttemptInControlPlane();
    return PICK_QUEUE;
  }
  if (ring[index].state == CONNECTING) {
    return PICK_QUEUE;
  }
}

As per gRFC A42, the ring_hash policy normally requires pick requests to trigger subchannel connection attempts, but if it is being used as a child of the priority policy, it will not be getting any picks once it reports TRANSIENT_FAILURE. To work around this, it currently makes sure that it is attempting to connect (after applicable backoff period) to at least one subchannel at any given time. After a given subchannel fails a connection attempt, it moves on to the next subchannel in the ring. This approach allows the policy to recover if any one endpoint becomes reachable, while also minimizing the number of endpoints it is trying to connect to simultaneously, so that it does not wind up with a lot of unnecessary connections when connectivity is restored. However, with the sticky-TF behavior, it will not be possible to attempt to connect to only one endpoint at a time, because when a given pick_first child reports TRANSIENT_FAILURE, it will automatically try reconnecting after the backoff period without waiting for a connection to be requested. Proposed psuedo-code for this logic is:

if (in_transient_failure && endpoint_entered_transient_failure) {
  first_idle_index = -1;
  for (i = 0; i < endpoints.size(); ++i) {
    if (endpoints[i].connectivity_state() == CONNECTING) {
      first_idle_index = -1;
      break;
    }
    if (first_idle_index == -1 && endpoints[i].connectivity_state() == IDLE) {
      first_idle_index = i;
    }
  }
  if (first_idle_index != -1) {
    endpoints[first_idle_index].RequestConnection();
  }
}

Note that this means that after an extended connectivity outage, ring_hash will now often wind up with many unnecessary connections. However, this situation is also possible via the picker if ring_hash is the last child under the priority policy, so we are willing to live with this behavior for now. If it becomes a problem in the future, we can consider ways to ameliorate it at that time.

Note that in C-core, the normal approach for handling address list updates described above won't work, because if we are creating the pick_first children lazily, then we will wind up not creating the children in the new endpoint list and thus never swapping over to it. As a result, ring_hash in C-core will use an approach more like that of Java and Go: it will maintain a map of endpoints by the set of addresses, and it will update that set in place when it receives an updated address list.

Because ring_hash chooses which endpoint to use via a hash function based solely on the first address of the endpoint, it does not make sense to have multiple endpoints with the same address that are differentiated only by the resolver attributes. Thus, resolver attributes are ignored when de-duping endpoints.

Outlier Detection

The goal of the outlier detection policy is to temporarily stop sending traffic to servers that are returning an unusually large error rate. The kinds of problems that it is intended to catch are primarily things that are independent of which address is used to connect to the server; a problem with the reachability of a particular address is more likely to cause connectivity problems than individual RPC failures, and problems that cause RPC failures are generally just as likely to occur on any address. Therefore, this design changes the outlier detection policy to make ejection decisions on a per-endpoint basis, instead of on a per-address basis as it does today. RPCs made to any address associated with an endpoint will count as activity on that endpoint, and ejection or unejection decisions for an endpoint will affect subchannels for all addresses of an endpoint.

As described in gRFC A50, the outlier detection policy currently maintains a map keyed by individual address. The map values contain both the set of currently existing subchannels for a given address as well as the ejection state for that address. This map will be split into two maps: a map of currently existing subchannels, keyed by individual address, and a map of ejection state, keyed by the unordered set of addresses on the endpoint.

The entry in the subchannel map will hold a ref to the corresponding entry in the endpoint map. This ref will be updated when the LB policy receives an updated address list, when the list of addresses in the endpoint changes. It will be used to update the successful and failed call counts as each RPC finishes. Note that appropriate synchronization is required for those two different accesses.

The entry in the endpoint map may hold a pointer to the entries in the subchannel map for the addresses associated with the endpoint, or the implementation may simply look up each of the endpoint's addresses in the subchannel map separately. These accesses from the endpoint map to the subchannel map will be performed by the LB policy when ejecting or unejecting the endpoint, to send health state notifications to the corresponding subchannels. Note that if the ejection timer runs in the same synchronization context as the rest of the activity in the LB policy, no additional synchronization should be needed here.

The set of entries in both maps will continue to be set based on the address list that the outlier detection policy receives from its parent. And the map keys will continue to use only the addresses, not taking resolver attributes into account.

Currently, the outlier detection policy wraps the subchannels and ejects them by reporting their connectivity state as TRANSIENT_FAILURE. As described above, we will change the outlier detection policy to instead eject endpoints by wrapping the subchannel's generic health reporting mechanism.

Support Multiple Addresses Per Endpoint in xDS

The EDS resource has been updated to support multiple addresses per endpoint in envoyproxy/envoy#27881. Specifically, that PR adds a new AdditionalAddress message, which contains a single address field, and it adds a repeated additional_addresses field of that type to the Endpoint proto.

When validating the EDS resource, when processing the Endpoint proto, we validate each entry of additional_addresses as follows:

  • If the address field is unset, we reject the resource.
  • If the address field is set, then we validate it exactly the same way that we already validate the Endpoint.address field.

Changes to Stateful Session Affinity

We need to support endpoints with multiple addresses in stateful session affinity (see gRFCs A55 and A60). We want to add one additional property here, which is that we do not want affinity to break if an endpoint has multiple addresses and then one of those addresses is removed in an EDS update. This will require some changes to the original design.

First, the session cookie, which currently contains a single endpoint address, will be changed to contain a list of endpoint addresses. As per gRFC A60, the cookie's format is a base64-encoded string of the form <address>;<cluster>. This design changes that format such that the address part will be a comma-delimited list of addresses. The StatefulSession filter currently sets a call attribute that communicates the address from the cookie to the xds_override_host LB policy; that call attribute will now contain the list of addresses from the cookie.

Next, the entries in the address map in the xds_override_host LB policy need to contain the actual address list to be used in the cookie when a given address is picked. Note that the original design already described how we would represent endpoints with multiple addresses in this map, since that was already possible in Java (see the description in A55 of handling EquivalentAddressGroups when constructing the map). However, the original design envisioned that we would store a list of addresses that would be looked up as keys in the map when finding alternative addresses to use, which we no longer need now that we will be encoding the list of addresses in the cookie itself. Instead, what we need from the map entry is the information necessary to construct the list of addresses to be encoded in the cookie when the address for that map entry is picked. Implementations will likely want to store this as a single string instead of a list, since that will avoid the need to construct the string on a per-RPC basis.

As per the original design, when returning the server's initial metadata to the application, the StatefulSession filter may need to set a cookie indicating which endpoint was chosen for the RPC. However, now that the cookie needs to include all of the endpoint's addresses and not just the specific one that is used, we need to communicate that information from the xds_override_host LB policy back to the StatefulSession filter. This will be done via the same call attribute that the StatefulSession filter creates to communicate the list of addresses from the cookie to the xds_override_host policy. That attribute will be given a new method to allow the xds_override_host policy to set the list of addresses to be encoded in the cookie, based on the address chosen by the picker. The StatefulSession filter will then update the cookie if the address list in the cookie does not match the address list reported by the xds_override_host policy. Note that when encoding the cookie, the address that is actually used must be the first address in the list.

In accordance with those changes, the picker logic will now look like this:

def Pick(pick_args):
  override_host_attribute = pick_args.call_attributes.get(attribute_key)
  if override_host_attribute is not None:
    idle_subchannel = None
    found_connecting = False
    for address in override_host_attribute.cookie_address_list:
      entry = lb_policy.address_map[address]
      if entry found:
        if (entry.subchannel is set AND
            entry.health_status is in policy_config.override_host_status):
          if entry.subchannel.connectivity_state == READY:
            override_host_attribute.set_actual_address_list(entry.address_list)
            return entry.subchannel as pick result
          elif entry.subchannel.connectivity_state == IDLE:
            if idle_subchannel is None:
              idle_subchannel = entry.subchannel
          elif entry.subchannel.connectivity_state == CONNECTING:
            found_connecting = True
    # No READY subchannel found.  If we found an IDLE subchannel,
    # trigger a connection attempt and queue the pick until that attempt
    # completes.
    if idle_subchannel is not None:
      hop into control plane to trigger connection attempt for idle_subchannel
      return queue as pick result
    # No READY or IDLE subchannels.  If we found a CONNECTING
    # subchannel, queue the pick and wait for the connection attempt
    # to complete.
    if found_connecting:
      return queue as pick result
  # pick_args.override_addresses not set or did not find a matching subchannel,
  # so delegate to the child picker.
  result = child_picker.Pick(pick_args)
  if result.type == PICK_COMPLETE:
    entry = lb_policy.address_map[result.subchannel.address()]
    if entry found:
      override_host_attribute.set_actual_address_list(entry.address_list)
  return result

Temporary environment variable protection

The code that reads the new EDS fields will be initially guarded by an environment variable called GRPC_EXPERIMENTAL_XDS_DUALSTACK_ENDPOINTS. This environment variable guard will be removed once this feature has proven stable.

Note that we will not use this environment variable to guard the Happy Eyeballs functionality, because that functionality will be on by default, not something that is enabled via external input.

Rationale

Happy Eyeballs Functionality

Note that we will not support all parts of "Happy Eyeballs" as described in RFC-8305. For example, because our resolver API does not provide a way to return some addresses without others, we will not start trying to connect before all of the DNS queries have returned.

Java and Go Pick First Restructuring

In Java and Go, pick_first is currently implemented inside the subchannel rather than at the LB policy layer. In those implementations, it might work to implement Happy Eyeballs inside the subchannel, which would avoid the need to make pick_first the universal leaf policy, and in Go, it would avoid the need to move the health-checking code out of the subchannel. However, that approach won't work for C-core, and we would like to take this opportunity to move toward a more uniform cross-language architecture. Also, moving pick_first up to the LB policy layer in Java and Go will have the nice effect of making their backoff work per-address instead of across all addresses, which is what C-core does and what the (poorly specified) connection backoff spec seems to have originally envisioned.

Reasons for Generic Health Reporting

Currently, client-side health checking and outlier detection signal unhealthiness by essentially causing the subchannel to report TRANSIENT_FAILURE to the leaf LB policy. This existing approach works reasonably when petiole policies directly create and manage subchannels, but it will not work when pick_first is the universal leaf policy. When pick_first sees its chosen subchannel transition from READY to TRANSIENT_FAILURE, it will interpret that as the connection failing, so it will unref the subchannel and report IDLE to its parent. This causes two problems.

The first problem is that we don't want unhealthiness to trigger connection churn, but pick_first would react in this case by dropping the existing connection unnecessarily. Note that, as described in gRFC A17, the client-side health checking mechanism does not work with pick_first, for this exact reason. In hindsight, we should have imposed the same restriction for outlier detection, but that was not explicitly stated in gRFC A50. However, that gRFC does say that outlier detection will ignore subchannels with multiple addresses, which is the case in Java and Go. In C-core, it should have worked with pick_first, although it turns out that there was a bug that prevented it from working, which means that we can know that no users were actually counting on this behavior. This means that we can retroactively say that outlier detection should never have worked with pick_first with minimal risk of affecting users that might have been counting on this use-case. (It might affect Java/Go channels that use pick_first and happen to have only one address, and it might have been used in Node.)

The second problem is that this would cause pick_first to report IDLE instead of TRANSIENT_FAILURE up to the petiole policy. This could affect the aggregated connectivity state that the petiole policy reports to its parent. And parent policies like the priority policy (see gRFC A56) may then make the wrong routing decision based on that incorrect state.

These problems are solved via the introduction of the Generic Health Reporting Mechanism.

Implementation

C-core

  • move client-side health checking out of subchannel so that it can be controlled by pick_first (grpc/grpc#32709)
  • assume LB policies start in CONNECTING state (grpc/grpc#33009)
  • prep for outlier detection ejecting via health watch (grpc/grpc#33340)
  • move pick_first off of the subchannel_list library that it previously shared with petiole policies, and add generic health watch support (grpc/grpc#34218)
  • change petiole policies to use generic health watch, and change outlier detection to eject via health state instead of raw connectivity state (grpc/grpc#34222)
  • change ring_hash to delegate to pick_first (grpc/grpc#34244)
  • add endpoint_list library for petiole policies, and use it to change round_robin to delegate to pick_first (grpc/grpc#34337)
  • change WRR to delegate to pick_first (grpc/grpc#34245)
  • implement happy eyeballs in pick_first (grpc/grpc#34426 and grpc/grpc#34717)
  • implement address interleaving for happy eyeballs (grpc/grpc#34615 and grpc/grpc#34804)
  • change resolver and LB policy APIs to support multiple addresses per endpoint, and update most LB policies (grpc/grpc#33567)
  • support new xDS fields (grpc/grpc#34506)
  • change outlier detection to handle multiple addresses per endpoint (grpc/grpc#34526)
  • change stateful session affinity to handle multiple addresses per endpoint (grpc/grpc#34472)

Java

  • move pick_first logic out of subchannel and into pick_first policy
  • make pick_first the universal leaf policy, including client-side health checking support
  • implement happy eyeballs in pick_first
  • fix ring_hash to support endpoints with multiple addresses
  • support new xDS fields

Go

  • change subchannel connectivity state API (maybe)
  • move pick_first logic out of subchannel and into pick_first policy
  • make pick_first the universal leaf policy, including client-side health checking support (includes moving health checking logic out of the subchannel)
  • change address list to support multiple addresses per endpoint and change LB policies to handle this (including ring_hash)
  • implement happy eyeballs in pick_first
  • support new xDS fields

Open issues (if applicable)

N/A