HAProxy/pfSense client ephemeral port exhaustion #275

jefflill · 2018-07-23T16:56:50Z

The current neonHIVE configuration can run into Linux SNAT/DNAT port exhaustion related issues when scaling the network traffic to medium or high loads. This problem can surface due to the Docker ingress/mesh network DNAT iptables rules but it can also happen in other places like the pfSense DMZ load balancer rules that direct external traffic to to cluster nodes.

There appear to be two somewhat related problems:

At high load, traffic being proxied by a load balancer or transformed by a DNAT will have the same source IP so only the source port can be varied when establishing a connection to the backend server. When the backend connection is closed, the source port will go into the TIME_WAIT for 2 minutes (on Linux) and cannot be reused again during this time. neonHIVE currently configures the kernel to allocate ephemeral ports in the range 9000-65535 (56535 ports) so assuming each backend connection is closed immediately so that the source port goes into TIME_WAIT, the maximum connections/sec is 56535/120 = 471/sec per hive host.
It also appears to be a Linux kernel race condition that can cause two inbound connections to be assigned the same DNAT source port resulting in SYN packets being dropped and then re-transmission delays. This is discussed in detail here. Note that this is not a Docker specific issue, it happens in Kubernetes too).

There are some possible mitigations:

Have neon-proxy-manager configure backend HTTP connections to remain alive where ever possible. I believe this is the default, but I should verify that I'm not disabling this (perhaps making this a load balancer rule option). This should go a long way towards preventing source port exhaustion in the neon-proxy-public and neon-proxy-private containers.
I am not currently configuring the source port range in the public or private proxy containers (I assumed this would be picked up from the hive host). I now doubt that this is actually true. In any case, I should modify the neon-proxy container to use sysctl to set net.ipv4.ip_local_port_range = 1024 65535 which is probably the largest number of possible ports.

NOTE: I tried setting net.ipv4.ip_local_port_range in the HAProxy Dockerfiles and also live within a running cointainer. This doesn't work for Docker containers by design since the namespaced container network stack is managed by the Docker engine. It looks like it's possible to pass --sysctl options to docker run ... but this option is not available for services. So it looks like about 32K ports is about all we can get.

Have neon-proxy-manager keep alive inbound HTTP connections too. I believe I'm currently closing connections which will result in possible port exhaustion at the pfSense load balancer as well as potentially poor latency due to having to establish new connections.
It appears that it will be possible (in the future) to mitigate issue 2 above by having Docker specifying the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag when generating its DOCKER-INGRESS DNAT rules. The latest version of iptables supports the --random-fully option but this version of iptables is not in the current kernel and Docker isn't currently generating this option anyway. One possible hack if this becomes unbearable might be to munge the iptables DNAT module so that it always sets this flag and deploy this to the hive hosts and perhaps even the pfSense boxes.
We can also look into having HAProxy route traffic to each backend via multiple network interfaces. neon-proxy-manager could generate these automatically but I wonder if it's possible to have Docker assign more than one interface to a container. We might also do the same thing with pfSense by assigning multiple network interfaces and munging the HAProxy backends (perhaps by hand).
If we're not able to assign multiple IPs to an HAProxy container, we can also simply deploy more of these containers to accomplish the same thing (at this cost of additional backend health checks).
There's something called iproute2 which looks like it can be used to mitigate port starvation. I don't understand this yet but it appears that you assign additional IP addresses to an interface. I wonder if this would work in a container.

Here are the links I found while researching this:

https://stackoverflow.com/questions/10085705/load-balancer-scalability-and-max-tcp-ports
https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02
https://github.com/tsenart/vegeta vegeta load generator

moby/moby#35082
http://archive.linuxvirtualserver.org/html/lvs-devel/2015-10/msg00067.html
https://medium.freecodecamp.org/how-we-fine-tuned-haproxy-to-achieve-2-000-000-concurrent-ssl-connections-d017e61a4d27
https://www.linangran.com/?p=547

The first two links really describe the problem. The third link is to the vegeta load generator project that looks like it's better than the Apache load generator we've been using.

The text was updated successfully, but these errors were encountered:

jefflill · 2018-07-23T17:30:49Z

I have confirmed that the neon-proxy based containers do not inherit the host machine net.ipv4.ip_local_port_range setting. We'll need to try to set this in its Dockerfile by modifying /etc/sysctl.conf (or perhaps /etc/sysctl.d/00-alpine.conf).

EDIT: You can set kernel parameters for containers using docker run --sysctl and there's a way to do this in a Docker stack, but there is no implementation for straight services. Here are the tracking issues:

moby/moby#25303 <-- EPIC
moby/moby#25209 <-- REQUEST

jefflill mentioned this issue Jul 23, 2018

iptables forwarding of port 80/443 traffic to neon-proxy-public not working with ingress network #274

Closed

jefflill added the backlog label Aug 19, 2018

jefflill closed this as completed Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAProxy/pfSense client ephemeral port exhaustion #275

HAProxy/pfSense client ephemeral port exhaustion #275

jefflill commented Jul 23, 2018 •

edited

jefflill commented Jul 23, 2018 •

edited

HAProxy/pfSense client ephemeral port exhaustion #275

HAProxy/pfSense client ephemeral port exhaustion #275

Comments

jefflill commented Jul 23, 2018 • edited

jefflill commented Jul 23, 2018 • edited

jefflill commented Jul 23, 2018 •

edited

jefflill commented Jul 23, 2018 •

edited