Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: kwargs #1156

Merged
merged 10 commits into from Apr 9, 2021
Merged

Engine: kwargs #1156

merged 10 commits into from Apr 9, 2021

Conversation

ditwoo
Copy link
Contributor

@ditwoo ditwoo commented Apr 3, 2021

Before submitting (checklist)

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contribution guide?
  • Did you check the code style? catalyst-make-codestyle && catalyst-check-codestyle (pip install -U catalyst-codestyle).
  • Did you make sure to update the docs? We use Google format for all the methods and classes.
  • Did you check the docs with make check-docs?
  • Did you write any new necessary tests?
  • Did you check that your code passes the unit tests pytest . ?
  • Did you add your new functionality to the docs?
  • Did you update the CHANGELOG?
  • Did you run colab minimal CI/CD with latest and minimal requirements?

Description

Related Issue

Type of Change

  • Examples / docs / tutorials / contributors update
  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

PS

  • I know, that I could join slack for pull request discussion.

model = ApexDistributedDataParallel(model, delay_allreduce=self.delay_all_reduce)
model, optimizer = amp.initialize(model, optimizer, **self.apex_kwargs)
# TODO: kwargs for Apex DDP ?
model = ApexDistributedDataParallel(model) # , delay_allreduce=self.delay_all_reduce)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also add ddp_kwargs and pass them here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, we have to remove ** from the init and make apex_krargs and ddp_kwargs dict storages

Copy link
Member

@Scitator Scitator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR looks amazing, nevertheless could we please make a few extra changes:

  • rename ddp_kwargs to dist_kwargs or process_kwargs, as far as they are used for torch.distributed.init_process_group
  • add truly ddp_kwargs and use them for ApexDistributedDataParallel and DistributedDataParallel wrappers
  • add an extra expectation for such cases - I mean, we should raise an error if we could not wrap the model correctly

Huge thanks in advance!

Comment on lines 334 to 335
if "device_ids" not in self.ddp_kwargs:
self.ddp_kwargs["device_ids"] = [self.device]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move this under def setup_process(self, rank: int = -1, world_size: int = 1): in the end
cause self.device = None

os.environ["MASTER_ADDR"] = str(self.address)
os.environ["MASTER_PORT"] = str(self.port)
dist.init_process_group(self.backend, rank=self.rank, world_size=self.world_size)
dist.init_process_group(**self.process_group_kwargs)
torch.cuda.set_device(int(self._rank))
self.device = f"cuda:{int(self._rank)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.device = f"cuda:{int(self._rank)}"
self.device = f"cuda:{int(self._rank)}"
if "device_ids" not in self.ddp_kwargs:
self.ddp_kwargs["device_ids"] = [self.device]

Comment on lines 333 to 335
self.ddp_kwargs = copy.deepcopy(ddp_kwargs)
if "device_ids" not in self.ddp_kwargs:
self.ddp_kwargs["device_ids"] = [self.device]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.ddp_kwargs = copy.deepcopy(ddp_kwargs)
if "device_ids" not in self.ddp_kwargs:
self.ddp_kwargs["device_ids"] = [self.device]
self.ddp_kwargs = copy.deepcopy(ddp_kwargs)

@Scitator Scitator merged commit 8940ef0 into master Apr 9, 2021
@mergify mergify bot deleted the engine-kwargs branch April 9, 2021 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants