Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS endpoint with varying IP address #1091

Open
dwickstrom opened this issue May 8, 2019 · 25 comments
Open

AWS endpoint with varying IP address #1091

dwickstrom opened this issue May 8, 2019 · 25 comments

Comments

@dwickstrom
Copy link

dwickstrom commented May 8, 2019

Please use the following questions as a guideline to help me answer
your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[x] elastic.v6 (for Elasticsearch 6.x)

Please describe the expected behavior

Hello 👋 We're trying to use this library with an AWS cluster of 3 nodes and specifying the endpoint hostname from AWS as a single entry in the hosts key in the library config file. The ideal situation would be where the client would be able to detect when the IP address changes, re-resolve the hostname and send a retry request, such that during re-provisioning phase no requests are dropped.

Please describe the actual behavior

Requests will fail during the provisioning phase and then, in our case after about 15 minutes, the client will heal itself and requests stop failing.

Because of AWS not exposing the node IPs on the /_nodes endpoint these are my thoughts so far:

With sniffing disabled we see that the single node connection won't be MarkAsDead, due to

elastic/client.go

Lines 1204 to 1209 in 60d62e5

if !c.snifferEnabled {
c.errorf("elastic: all %d nodes marked as dead; resurrecting them to prevent deadlock", len(c.conns))
for _, conn := range c.conns {
conn.MarkAsAlive()
}
}

With sniffing enabled it's not going to work because sniffing can't be done due to AWS only exposing the load balancer IP. The client won't be able to detect any other nodes:

elastic/client.go

Lines 964 to 978 in 60d62e5

if err := json.NewDecoder(res.Body).Decode(&info); err == nil {
if len(info.Nodes) > 0 {
for nodeID, node := range info.Nodes {
if c.snifferCallback(node) {
if node.HTTP != nil && len(node.HTTP.PublishAddress) > 0 {
url := c.extractHostname(c.scheme, node.HTTP.PublishAddress)
if url != "" {
nodes = append(nodes, newConn(nodeID, url))
}
}
}
}
}
}
return nodes

Any steps to reproduce the behavior?

  1. instantiate a new client, setting the AWS endpoint as a single host entry in the config
  2. Trigger cluster re-provisioning in AWS, described here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-configuration-changes
@olivere
Copy link
Owner

olivere commented May 8, 2019

Hmm... if I understand it correctly, with AWS you should simply disable sniffing and health checks as AWS does load-balancing for you, and you should simply use the hostname provided by AWS as a single endpoint.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

I'm sorry if misunderstood—I'm not an AWS customer.

@dwickstrom
Copy link
Author

Thank you for responding so quickly. Your suggestion to disable both sniffing and healthchecks sounds good and I'm trying it in a moment.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

Yes, and even with healthchecks & sniffing disabled I'm guessing that this problem will appear. I'll post my findings here soon.

@dwickstrom
Copy link
Author

Alright, thank you! That seems to have solved the problem 🎈 Not sure why though 🤔

I was hoping to learn something from this so I figure I should probably give some more context as to what I was going through.
They way we are set up is that we have a cluster of 3 master nodes that we connect to through a single endpoint, as you probably know, the one provided through the aws console. Initially I though of that AWS endpoint address as likely pointing to a load balancer, but after a while I realised that this isn't the case. Instead it's going to cycle randomly, resolving to the IP of any of the nodes.

And so the problem that I have been trying to resolve happens when the cluster is re-provisioned. This is what I think happens during that phase:

  1. The amount of nodes are doubled
  2. The data from the first set of nodes are migrated over to the new cluster
  3. Once data is migrated the endpoint starts resolving to the addresses of the new set of cluster nodes
  4. When data migration is complete, the old nodes are shut down one by one

Here's what I don't understand: after step 3, in the case where healthchecks are enabled, why would the healthcheck requests start failing - as opposed to when healthchecks are disabled, why would the normal requests not fail?

@olivere
Copy link
Owner

olivere commented May 10, 2019

Hmm... let's see.

First of all, the whole idea of sniffing and health checks is only necessary because in the early days of ES, load-balancing was done client-side. If you have a server-side solution, which I think is the right solution, you shouldn't be able to do any of the things. Just let the server do the right thing and keep the client dumb.

Now, sniffing is the process of initially and periodically finding the list of nodes in the connected cluster. Let's say you initially have a 1 node cluster and use elastic to connect to that cluster with a URL. Then ES will use the URL to find all nodes in the cluster (1 node only) via the Cluster State API. It will then throw away the initial URL and use the IP/hostname reported from the cluster API. Once in a while, this process is re-executed to find new nodes in the cluster that were eventually added by the admin. So, eventually, ES will have a full list of IPs/hostnames to connect to and use them via round-robin. Notice there are a few edge-cases like ensuring this process if we do end up with an empty list of nodes for some reason. But let's try to keep it simple.

Health checks serve another purpose. They periodically check the list of nodes and manage the individual state of those nodes. E.g. if elastic tried to send a request to a node that didn't respond, it is marked as dead and no longer used. However, that could only be a blip in the network, so health check runs periodically to mark them as alive eventually. Again, there are some edge cases.

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

@dwickstrom
Copy link
Author

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

That sounds reasonable to me.

In any case it might be a good idea to put it into the AWS section of the wiki, to not use either sniffing or healthchecks.

@olivere
Copy link
Owner

olivere commented May 12, 2019

I changed the docs in the Wiki and advised to disable both sniffing and health checks for AWS Elasticsearch Service.

@olivere olivere closed this as completed May 12, 2019
@dwickstrom
Copy link
Author

Great, thanks for helping out 🥇

@iandees
Copy link

iandees commented Nov 6, 2019

I'm running into the same problem as David, even with healthcheck and sniff turned off. @dwickstrom do you remember if you changed anything on the underlying HTTP client instance maybe?

@dwickstrom
Copy link
Author

dwickstrom commented Nov 7, 2019

Hi @iandees, no I didn't change anything on the HTTP client. Lately however there has been some issue with this, again. Back in May, the way I tested it was by toggling some parameter in the cluster settings, to trigger a cluster "rollover". Recently however, when AWS themselves were triggering an elasticsearch upgrade on their side, that "rollover" did not go well - clients were not able to connect without intervention, just like the incidents I had ~6 months ago.

@olivere
Copy link
Owner

olivere commented Nov 7, 2019

Maybe there's still a problem. Reopening.

@olivere olivere reopened this Nov 7, 2019
@olivere
Copy link
Owner

olivere commented Nov 7, 2019

There was a change quite recently that addresses an issue on AWS ES with nodes changing IPs particularly. Don't know if this has anything to do with it. #1125

@g-wilson
Copy link

Hi all, resurrecting this thread to shine some more info. We're seeing this issue as well. After some pretty thorough testing I can replicate the issue. I don't think the issue is with this library.

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

If an HTTP client is used which uses keep-alive connections (http.DefaultClient does by default), and your volume of requests is high enough that the idle timeout is never reached, the connection will not be re-established.

This means that when AWS rotates the nodes and changes the DNS records, an application using this library is none the wiser, it won't do another DNS lookup until the connections are left idle and then terminated.

Eventually this library does recognise that requests are failing and resets everything, however this does cause fairly significant interruption of service.

This issue is described well here golang/go#23427

@olivere
Copy link
Owner

olivere commented May 23, 2020

Thanks for reporting your findings, @g-wilson.

@Sovietaced
Copy link

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

In this case it seems like clients would benefit from sniffing.

@olivere
Copy link
Owner

olivere commented May 18, 2021

@Sovietaced I'm not sure that's correct. Sniffing is a process by which the client library asks the ES cluster (not the DNS) for the IP addresses of the nodes, then uses those and watches for changes; that's effectively client-side LB. In case of DNS-based LB, the ES cluster usually doesn't know nor update its internal IP addresses. Hence, I think, disabling sniffing and healthchecks is the right way to use Elastic on AWS. Again, I'm not an active user of Elastic on AWS ES.

The problem is, though, that Go itself caches IP addresses for a while, and doesn't resolve for each and every request, hence the reference to golang/go#23427.

@Sovietaced
Copy link

Sovietaced commented May 18, 2021

@olivere The Java library has the same problem. It resolves an IP address from the ES cluster domain name and caches the IP address of the data node indefinitely. What we notice is that if the IP of the data node we have no longer becomes a data node, our applications are essentially broken (receiving 503s) until we restart them and they get a new IP address from the AWS ES cluster DNS.

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

@Sovietaced
Copy link

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

I ended up testing sniffing with the AWS ES cluster and it appears that the /_nodes/http?pretty=true API does not even include http info about the nodes so sniffing doesn't work.

@olivere
Copy link
Owner

olivere commented May 19, 2021

Interesting. Maybe we should accommodate to that and—at least—log a warning.

I will have to test this out on AWS ES.

@wingsofovnia
Copy link

wingsofovnia commented May 19, 2021

Here is a comparison of how AWS ES response differs from a normal ES deployment: elastic/elasticsearch-js#1178 (comment)

One way to mitigate this might be a custom sniffer that does nslookup instead of
GET_nodes/.

root@shell:/# nslookup aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com

Server: a.a.a.a
Address: b.b.b.b#...

Non-authoritative answer:
Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: x.x.x.x # Node IP 1

Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: y.y.yy # Node IP 2

@olivere
Copy link
Owner

olivere commented May 19, 2021

Thanks for the links. Very helpful.

@Sovietaced
Copy link

Sovietaced commented May 20, 2021

For what its worth, we ended up writing our own custom sniffer and it appears to work well. I forced a blue/green deployment of an AWS ES cluster and I watched the IP addresses flip with no downtime.

I realize this is a Go library but folks may find this generally useful. This is the basic logic for a periodic task that runs in the background. Note: This approach depends on having a DNS cache TTL set.

Following code is in Kotlin

val addresses: List<InetAddress>

try {
   // host.hostName is the cluster domain name provided by AWS
    addresses = InetAddress.getAllByName(host.hostName).asList()
} catch (e: UnknownHostException) {
    throw AwsSnifferException("Failed to resolve addresses for ${host.hostName}", e)
}

logger.debug("Sniffed addresses: $addresses")

if (addresses.isEmpty()) {
    logger.warn("No nodes to set")
} else {
    val nodes = addresses.stream()
        // Generate new hosts with the address swapped. Retain port/scheme
        .map { HttpHost(it.hostAddress, host.port, host.schemeName) }
        .map { Node(it) }
        .toList()

    logger.debug("Calculated nodes: $nodes")

    restClient.setNodes(nodes)
}

@chrisharrisonkiwi
Copy link

chrisharrisonkiwi commented May 31, 2021

I'm also running into the exact same issue with AWS.
Any ElasticSearch modification or automated action resulting in the nodes being reassigned seems to result in the issue for around 15 minutes (With both Sniffing and Healthchecks turned off).

Is there an easy way with this library to force a reconnection to the cluster maybe?
Might be nice to have a client.Reconnect() option in the event that no nodes are available?
I guess I could run client.Stop() and then get a new connection using elastic.NewClient() and see if the new connection has correctly mapped nodes etc.

-- edit
I tried the new client idea and it seemed to work. But it's a bit of a sledgehammer on a nail approach.

@g-wilson
Copy link

g-wilson commented Jun 7, 2021

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

@chrisharrisonkiwi
Copy link

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

Yup this works also. A little bit cleaner than the fresh client approach I guess.

olivere added a commit that referenced this issue Jul 8, 2021
This commit adds a configuration option `SetCloseIdleConnections` to a
client. The effect of enabling it is that whenever the Client finds a
dead node, it will call `CloseIdleConnections` on the underlying HTTP
transport.

This is useful for.e.g. AWS Elasticsearch Service. When AWS ES
reconfigures the cluster, it may change the underlying IP addresses
while keeping the DNS entry stable. If the Client would _not_ close idle
connections, the underlying HTTP client would re-use existing HTTP
connections and use the old IP addresses. See #1091 for a discussion of
this problem.

The commit also illustrates how to connect to an AWS ES cluster in the
recipes in
[`recipes/aws-mapping-v4`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-mapping-v4)
and
[`recipts/aws-es-client`](https://github.com/olivere/elastic/tree/release-branch.v7/recipes/aws-es-client).
See the `ConnectToAWS` method for a blueprint of how to connect to an
AWS ES cluster.

See #1091
@olivere
Copy link
Owner

olivere commented Jul 8, 2021

I've been looking into this and am experimenting with an additional elastic.SetCloseIdleConnections(true|false) configuration option for elastic.NewClient. When enabled, the PerformRequest method will automatically close idle connections in the underlying HTTP transport whenever it finds a dead node. This should make sure that the client picks up the new IP address whenever the AWS ES cluster reconfigures in any of the specified configuration changes.

If some of you could look into this and give it a thumbs up, #1507 might land in one of the next releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants