New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo of federated learning using NVFlare #7879
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something weird happened when I was trying to install nvflare:
$ pip install nvflare
Collecting nvflare
Using cached nvflare-2.0.16-py3-none-any.whl (797 kB)
Requirement already satisfied: numpy in /home/jiaming/.anaconda/envs/xgboost_dev/lib/python3.10/site-packages (from nvflare) (1.21.6)
Collecting grpcio
Using cached grpcio-1.46.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
Collecting nvflare
Using cached nvflare-2.0.15-py3-none-any.whl (797 kB)
Using cached nvflare-2.0.14-py3-none-any.whl (814 kB)
Using cached nvflare-2.0.13-py3-none-any.whl (801 kB)
Using cached nvflare-2.0.12-py3-none-any.whl (799 kB)
Using cached nvflare-2.0.11-py3-none-any.whl (788 kB)
Using cached nvflare-2.0.10-py3-none-any.whl (781 kB)
Using cached nvflare-2.0.9-py3-none-any.whl (781 kB)
Using cached nvflare-2.0.8-py3-none-any.whl (776 kB)
Using cached nvflare-2.0.7-py3-none-any.whl (776 kB)
Using cached nvflare-2.0.6-py3-none-any.whl (776 kB)
Using cached nvflare-2.0.5-py3-none-any.whl (771 kB)
Using cached nvflare-2.0.4-py3-none-any.whl (767 kB)
Using cached nvflare-2.0.3-py3-none-any.whl (762 kB)
Using cached nvflare-2.0.2-py3-none-any.whl (753 kB)
Using cached nvflare-2.0.1-py3-none-any.whl (418 kB)
Using cached nvflare-2.0.0-py3-none-any.whl (418 kB)
Using cached nvflare-1.0.2-py3-none-any.whl (510 kB)
Downloading nvflare-1.0.1-py3-none-any.whl (510 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━ 481.3/510.5 KB 551.6 kB/s eta 0:00:01
And then when I try to install a specific version:
pip install nvflare==2.0.15
Collecting nvflare==2.0.15
Using cached nvflare-2.0.15-py3-none-any.whl (797 kB)
ERROR: Could not find a version that satisfies the requirement tenseal==0.3.0 (from nvflare) (from versions: none)
ERROR: No matching distribution found for tenseal==0.3.0
So, I haven't tried the demo yet and don't know how it actually works. The code looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of using nvflare here vs. the previous example?
Just trying to wrap my head around what each component is doing.
|
||
def start_controller(self, fl_ctx: FLContext): | ||
self._server = multiprocessing.Process( | ||
target=xgboost.federated.run_federated_server, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to create a grpc server at the python layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we could write the grpc server in python, but it might have some limitations when it comes to threading. We are still talking with the nvflare team to figure out the details, so this could change in the future.
demo/nvflare/custom/trainer.py
Outdated
xgb.rabit.init([e.encode() for e in rabit_env]) | ||
|
||
# Load file, file will not be sharded in federated mode. | ||
dtrain = xgb.DMatrix('agaricus.txt.train') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not using the split data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prepare_data.sh
we copy each split into the site-specific directory. This probably looks more like the real federated environment.
@trivialfis nvflare seems to have some issue with python 3.9/3.10. I had to specify 3.8 for it to work. Added a note to the readme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell for now this is a pretty "shallow" integration, but nvflare can still provide support for managing the federated environment. We are working with the nvflare team to figure out how to get tighter integration for better privacy etc.
|
||
def start_controller(self, fl_ctx: FLContext): | ||
self._server = multiprocessing.Process( | ||
target=xgboost.federated.run_federated_server, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we could write the grpc server in python, but it might have some limitations when it comes to threading. We are still talking with the nvflare team to figure out the details, so this could change in the future.
demo/nvflare/custom/trainer.py
Outdated
xgb.rabit.init([e.encode() for e in rabit_env]) | ||
|
||
# Load file, file will not be sharded in federated mode. | ||
dtrain = xgb.DMatrix('agaricus.txt.train') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prepare_data.sh
we copy each split into the site-specific directory. This probably looks more like the real federated environment.
demo/nvflare/custom/trainer.py
Outdated
f'federated_client_key={self._client_key_path}', | ||
f'federated_client_cert={self._client_cert_path}' | ||
] | ||
xgb.rabit.init([e.encode() for e in rabit_env]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the RabitContext
from dask module to rabit module and reuse it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. But this changes the class name in dask. Is that what we want? Maybe we can keep the same name RabitContext
in the dask module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I made the change to your PR.
A simple demo of federated learning using NVFlare.
Part of #7778