Addressing NCCL issue with binary classification for distributed training #384
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
dmlc/xgboost#7982 (comment) dmlc/xgboost#8257
Issue #, if available:
XGBoost-1.7 introduced a braking change that introduced issue with NCCL. To address the issue, we need to set the NCCL_SOCKET_IFNAME env. variable
Description of changes:
set NCCL_SOCKET_IFNAME
Testing:
Created a test image: 900597767885.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:nikhil-test
And tested the image, it is not throwing an error anymore for binary classification. (arn:aws:sagemaker:us-west-2:900597767885:training-job/sagemaker-xgboost-2023-03-17-14-36-52-062)
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.