Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying network interface #260

Open
YangZhou1997 opened this issue Jan 23, 2023 · 7 comments
Open

Specifying network interface #260

YangZhou1997 opened this issue Jan 23, 2023 · 7 comments

Comments

@YangZhou1997
Copy link

YangZhou1997 commented Jan 23, 2023

Hi Xgboost_ray authors,

I just wonder if it is possible to specify the network interface used in xgboost_ray/xgboost. Currently I am running xgboost_benchmark.py in a shared testbed (https://www.cloudlab.us/) where each machine has one public network interface and one internal network interface. However, xgboost_ray/xgboost would automatically choose the public network interface, which has a much lower network bandwidth than the internal one.

This is what my machine has from ifconfig, and I would like to use interface ens1f1 instead of eno49. Is there any way to achieve that. Thanks in advance!

eno49: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 128.110.218.133  netmask 255.255.248.0  broadcast 128.110.223.255
        inet6 fe80::9af2:b3ff:fecc:350  prefixlen 64  scopeid 0x20<link>
        ether 98:f2:b3:cc:03:50  txqueuelen 1000  (Ethernet)
        RX packets 466641549  bytes 669837883826 (669.8 GB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 460255655  bytes 662032874073 (662.0 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.6.1  netmask 255.255.255.0  broadcast 192.168.6.255
        inet6 fe80::9edc:71ff:fe49:a8c1  prefixlen 64  scopeid 0x20<link>
        ether 9c:dc:71:49:a8:c1  txqueuelen 1000  (Ethernet)
        RX packets 668893189  bytes 967152831349 (967.1 GB)
        RX errors 0  dropped 8357  overruns 0  frame 0
        TX packets 578906674  bytes 820686873410 (820.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 76349340  bytes 1636796444574 (1.6 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 76349340  bytes 1636796444574 (1.6 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Best,
Yang

@Yard1
Copy link
Member

Yard1 commented Jan 24, 2023

If I understand correctly, you are creating a cluster manually (by calling ray start on each node). Is that right?

@YangZhou1997
Copy link
Author

Yes. I use ray start to spawn ray processes on multiple machines:

# on head node (eg, node-0 on cloudlab)
ulimit -n 65536; ray start --head --port=6379 --node-ip-address=192.168.6.1

# on worker nodes (eg, node-1 on cloudlab)
ulimit -n 65536; ray start --address='192.168.6.1:6379' --node-ip-address=192.168.6.2

I also change ray get_node_ip_address() function to always return the IP address of the internal network interface, so that ray actor/task/object store communicates through the internal network interface.

But it seems xgboost_ray/xgboost uses its own collective communication framework that would automatically choose the public network interface.

Best,
Yang

@Yard1
Copy link
Member

Yard1 commented Jan 25, 2023

Hmm, xgboost_ray should use the IP returned of get_node_ip_address. Can you run python -c "import ray; import ray.util; ray.init(); print(ray.util.get_node_ip_address())" and see what is returned on the nodes?

@YangZhou1997
Copy link
Author

It returns "192.168.6.1" on node-0, and "192.168.6.2" on node-1, etc.

@Yard1
Copy link
Member

Yard1 commented Jan 25, 2023

How do you detect that xgboost chooses the wrong interface? One place where extra logging can be added is xgboost_ray.main._start_rabit_tracker (print out the host IP).

@YangZhou1997
Copy link
Author

I watch ifconfig to see the traffic statistics going through different interfaces and find significant traffic (eg, ~0.5GB/s) going through the interface eno49. I also get a warning from Cloudlab that I use too much public-facing network bandwidth.

@Yard1
Copy link
Member

Yard1 commented Jan 25, 2023

I am not an expert in Rabit, but it looks like you may need to set up OS-level routing for it to work, or disable the other network interface. I'd also consider opening an issue in the xgboost repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants