Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Allow user to specify listening ip address for milvus components #26538

Open
1 task done
LoveEachDay opened this issue Aug 22, 2023 · 19 comments
Open
1 task done
Assignees
Labels
kind/feature Issues related to feature request from users

Comments

@LoveEachDay
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

By default, milvus components will use funcutil.GetLocalIP() to initialize grpc listening ip address, which may not work when pod/container has multiple network interfaces such as docker swarm, see also: #25032

func (p *grpcConfig) init(domain string, base *BaseTable) {
	p.Domain = domain
	p.IP = funcutil.GetLocalIP()

	p.Port = ParamItem{
		Key:          p.Domain + ".port",
		Version:      "2.0.0",
		DefaultValue: strconv.FormatInt(ProxyExternalPort, 10),
		Export:       true,
	}
	p.Port.Init(base.mgr)
       ...
}
// GetLocalIP return the local ip address
func GetLocalIP() string {
	addrs, err := net.InterfaceAddrs()
	if err == nil {
		for _, addr := range addrs {
			ipaddr, ok := addr.(*net.IPNet)
			if ok && ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil {
				return ipaddr.IP.String()
			}
		}
	}
	return "127.0.0.1"
}

Describe the solution you'd like.

In the milvus.yaml we can add an address config for each components which allow user to specify a routable address like this:

# configs/milvus.yaml
queryCoord:
   address: <ip address>
   port: 19531

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@LoveEachDay LoveEachDay added the kind/feature Issues related to feature request from users label Aug 22, 2023
@LoveEachDay
Copy link
Contributor Author

/assign @jiaoew1991

@xiaofan-luan
Copy link
Contributor

/assign @smellthemoon
can you help on it?

@congqixia
Copy link
Contributor

@LoveEachDay will "0.0.0.0" do for docker swarm case?

@LoveEachDay
Copy link
Contributor Author

@congqixia 0.0.0.0 may not work.
Milvus component will store its serving ip address in the etcd, so that other components can access through. If it is listening on all network interfaces, which ip address should we stored in the session?

We have two problems here:

  1. which ip address to listening,
  2. ip address should to be routable

Bind to 0.0.0.0 may solve case (1), but does not solve case (2).

@smellthemoon
Copy link
Contributor

working on it

@ieugen
Copy link

ieugen commented Sep 20, 2023

@LoveEachDay : I believe in other distributed systems there is an --advertise-address option where you can set what address to advertise to clients.

--advertise-address string
  | The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.

@ieugen
Copy link

ieugen commented Sep 20, 2023

In the case of docker and kubernetes IP addresses are dynamically allocated.

It might make sense to be able to specify a CIDR network range to filter the IP or a network interface name as well ?!

@LoveEachDay
Copy link
Contributor Author

@LoveEachDay : I believe in other distributed systems there is an --advertise-address option where you can set what address to advertise to clients.

--advertise-address string
  | The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.

@ieugen We use the way to set up the advertise addr. In Kubernetes, only one pod ip address will be assigned, you can leave the address field to be empty, which will used the default ip address.

Pass a cidr may not feasible. It's the user ability to make sure the ip address is routable and can be statically assigned.

@ieugen
Copy link

ieugen commented Oct 9, 2023

@LoveEachDay

I think some filtering could be implemented so that the code would work like this:
See original code bellow my pseudo code attempt (I don't know go).

func GetIP (ip string, options)  {
        if len(ip) == 0 {
		return GetLocalIP(options)
	}
	return ip
}

// GetLocalIP return the local ip address
func GetLocalIP(options ??? ) string {
	addrs, err := net.InterfaceAddrs()
	if err == nil {
		for _, addr := range addrs {
			ipaddr, ok := addr.(*net.IPNet)
			if ok && matchIpOptions(ipaddr,options) {
				return ipaddr.IP.String()
			}
		}
	}
	return "127.0.0.1"
}

func matchIpOptions(ipaddr, options) {

 // check ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil 
 and 
 // check if ipaddr matches the filtering options in "options"
 // checks could be any checks supported by https://pkg.go.dev/net#Addr .   
  // parse CIDR received in options with  https://pkg.go.dev/net#ParseCIDR in and check if IP is in the network
 // check if 
 // This might be usefull to get network name from IP https://pkg.go.dev/net#IPAddr.Network
 // This builds a network object (from a CIDR ?) https://pkg.go.dev/net#IPAddr.Network
// This checks if IP is inside network https://pkg.go.dev/net#IPNet.Contains
}

Original code here
https://github.com/milvus-io/milvus/blob/eca79d149cc2d5522f43a1f2c5f2a5bfcc71c8ca/pkg/util/funcutil/func.go#L62C13-L62C13

func GetIP(ip string) string {
	if len(ip) == 0 {
		return GetLocalIP()
	}
	return ip
}

// GetLocalIP return the local ip address
func GetLocalIP() string {
	addrs, err := net.InterfaceAddrs()
	if err == nil {
		for _, addr := range addrs {
			ipaddr, ok := addr.(*net.IPNet)
			if ok && ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil {
				return ipaddr.IP.String()
			}
		}
	}
	return "127.0.0.1"
}

@ieugen
Copy link

ieugen commented Oct 26, 2023

@LoveEachDay : I have added more information connected to this issue #25032 (comment) .

In docker swarm, addding healthchecks to milvus makes the service run eventually.
It restarts until it binds to the proper interface / IP address.
I guess this is due to the how the IP is selected.

Perhaps different components each go through the IP selection process and reach different results => failures ?!

@xiaofan-luan
Copy link
Contributor

this should be supported on milvus 2.3.2

@ieugen
Copy link

ieugen commented Oct 27, 2023

thank you @xiaofan-luan . I am curios which parts. For docker swarm, since we can't specify container IP, we would need a CIDR based filter (we specify the CIDR the app should filter interfaces, when it finds an IP in that CIDR range, it should bind to it).

There is an old issue with swarm and IP addresses for containers: moby/moby#24170 .

To have an idea, this is what ip addr shows inside a container.
Container IP addresses are randomly assigned - one is ip inide docker_gwbridge network and the other is container IP for it's associated network.

root@dev_db-1-dev1:/# ip addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
678: eth0@if679: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UP group default 
    link/ether REDACTED brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.1.186/24 brd 10.0.1.255 scope global eth0
       valid_lft forever preferred_lft forever
680: eth1@if681: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether REDACTED brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
       valid_lft forever preferred_lft forever

@ieugen
Copy link

ieugen commented Feb 26, 2024

On a server with IPV6 address, the service will restart continuously.
So my workaround for the IP binding bug is not working.
I have to setup a special IPV4 only server for milvus.
If it where for me I would ban milvus, but I only manage the apps so I have to put up with it.

@xiaofan-luan
Copy link
Contributor

how you you deploy any database if you can specify which port to use?
For example, pg will use 5432

@congqixia 0.0.0.0 may not work. Milvus component will store its serving ip address in the etcd, so that other components can access through. If it is listening on all network interfaces, which ip address should we stored in the session?

We have two problems here:

  1. which ip address to listening,
  2. ip address should to be routable

Bind to 0.0.0.0 may solve case (1), but does not solve case (2).

the internal design of milvus works fine. It stored both endpoint and port.

I think the problem is how SDK connect to milvus

@xiaofan-luan
Copy link
Contributor

g l

On a server with IPV6 address, the service will restart continuously. So my workaround for the IP binding bug is not working. I have to setup a special IPV4 only server for milvus. If it where for me I would ban milvus, but I only manage the apps so I have to put up with it.

Non of us are docker swarm expert and milvus is ideally designed for K8s.
Feel free to directly issue a pr on how to deploy milvus on docker swarm and this could definitely help more users.

@ieugen
Copy link

ieugen commented Feb 27, 2024

hi @xiaofan-luan ,

Thanks for replying.
It's not docker swarm specific.
The issue is with how milvus selects the interface to bind to / connect to other services.
Right now it's completely non-deterministic and non-configurable.
So on any system with more than one network interface it is likely to fail.

It's more prevalent in Docker Swarm because swarm has 1 network interface per network (2 + localhost minimum) and I can't do anything about that unfortunatelly (because of Swarm limitations, not milvus).
This is unfortunate for me because I have to deal with it.

Yesterday I was very helpless when I had to "fix" this after a 12h weekend upgrade of our servers.
I was tired but everything was working ok - except milvus .
It managed to stir a lot of frustration inside.

The logic for selecting the IP is exemplified above.
As you can see - it's a for loop and gets the first network interface that it finds.
If the OS list is random, it might work, it might not.
If the OS list is sorted, and the first valid interface is not the right now, it will always fail.

#26538 (comment)

I considered fixing this myuself - but I don't know go and I am not working with milvus directly.
I just deploy and maintain it - and have to deal with the outages, etc.

For some reason it started working after a few hours and a lot of config tweeking.
I am afraid to restart it and migrate it - but I do have to do server maintenance, so there is no avoiding that.
The only solution IMO would be a fix where users can select the network interface either by name or better IMO by matching the interface IP to a CIDR subnet supplied by user.

User supplies 10.11.0.0/24 CIDR (or IPV6 one ?! ) and milvus can check if the network interface has an IP in that range.
If no such interface is found, an error is thrown (since user requested a specific network interface) .

The code IMO should look something like this:

network := "192.168.5.0/24"
clientips := []string{
    "192.168.5.1",
    "192.168.6.0",
}
_, subnet, _ := net.ParseCIDR(network)
for _, clientip := range clientips {
    ip := net.ParseIP(clientip)
    if subnet.Contains(ip) {
        fmt.Println("IP in subnet", clientip)
    }
}

@xiaofan-luan
Copy link
Contributor

hi @xiaofan-luan ,

Thanks for replying. It's not docker swarm specific. The issue is with how milvus selects the interface to bind to / connect to other services. Right now it's completely non-deterministic and non-configurable. So on any system with more than one network interface it is likely to fail.

It's more prevalent in Docker Swarm because swarm has 1 network interface per network (2 + localhost minimum) and I can't do anything about that unfortunatelly (because of Swarm limitations, not milvus). This is unfortunate for me because I have to deal with it.

Yesterday I was very helpless when I had to "fix" this after a 12h weekend upgrade of our servers. I was tired but everything was working ok - except milvus . It managed to stir a lot of frustration inside.

The logic for selecting the IP is exemplified above. As you can see - it's a for loop and gets the first network interface that it finds. If the OS list is random, it might work, it might not. If the OS list is sorted, and the first valid interface is not the right now, it will always fail.

#26538 (comment)

I considered fixing this myuself - but I don't know go and I am not working with milvus directly. I just deploy and maintain it - and have to deal with the outages, etc.

For some reason it started working after a few hours and a lot of config tweeking. I am afraid to restart it and migrate it - but I do have to do server maintenance, so there is no avoiding that. The only solution IMO would be a fix where users can select the network interface either by name or better IMO by matching the interface IP to a CIDR subnet supplied by user.

User supplies 10.11.0.0/24 CIDR (or IPV6 one ?! ) and milvus can check if the network interface has an IP in that range. If no such interface is found, an error is thrown (since user requested a specific network interface) .

The code IMO should look something like this:

network := "192.168.5.0/24"
clientips := []string{
    "192.168.5.1",
    "192.168.6.0",
}
_, subnet, _ := net.ParseCIDR(network)
for _, clientip := range clientips {
    ip := net.ParseIP(clientip)
    if subnet.Contains(ip) {
        fmt.Println("IP in subnet", clientip)
    }
}

This could be an option.
has a config useCIDR and let the admin manage this.
It seems extremely difficult for a distributed database to work with docker swarm. Sorry about the trouble. Maybe using standalone can solve the problem somehow.

@LoveEachDay
what do you think about adding this cidr check and also check IsGlobalUnicast

@LoveEachDay
Copy link
Contributor Author

@xiaofan-luan @ieugen Add a cidr check would be a good option to cover this scenario which has multiple ips for milvus components. I would suggest we add a config option in the milvus.yaml like this:

// configs/milvus.yaml
rootCoord:
  ip: ""
  cidr: ""
  port: 53100

And in the pkg/util/paramtable//grpc_prams.go, we take up the config initialization logic here.

func (p *grpcConfig) init(domain string, base *BaseTable) {
	p.Domain = domain
	ipItem := ParamItem{
		Key:          p.Domain + ".ip",
		Version:      "2.3.3",
		DefaultValue: "",
		Export:       true,
	}
	ipItem.Init(base.mgr)
       // here add cidr initialization 
	p.IP = funcutil.GetIP(ipItem.GetValue()) // override GetIP util helper to take cidr in account.
...
}


@ieugen
Copy link

ieugen commented Mar 14, 2024

Any plans to get this feature in a release ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

6 participants