Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Closed
SeanNaren opened this issue Nov 8, 2021 · 25 comments · Fixed by #11040
Closed

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

SeanNaren opened this issue Nov 8, 2021 · 25 comments · Fixed by #11040
Assignees
Labels
deprecation Includes a deprecation design Includes a design discussion refactor
Milestone

Comments

@SeanNaren
Copy link
Contributor

SeanNaren commented Nov 8, 2021

Proposed refactoring or deprecation

Currently we have two methods to specifying devices. Let's take GPUs for example:

  1. The standard case that we've all grown used to and are mostly aware of.
trainer = Trainer(gpus=2)
  1. Introduced in 1.5, tries to make the number of devices agnostic. This means if you specify accelerator='tpu' we automatically know to use 2 TPU cores.
trainer = Trainer(devices=2, accelerator='gpu')

Recently, it has come up in #10404 (comment) that we may want to deprecate and prevent further device specific names from appearing in the Trainer (such as hpus).

Related conversation #9053 (comment)

I see two options:

🚀 We keep both device specific arguments (gpus tpu_cores ipus for the Trainer) and devices
👀 We drop gpus tpu_cores ipus in the future and fully rely on devices. (Potentially this would likely be done in Lightning 2.0, instead of after 2 minor releases)

cc @kaushikb11 @justusschock @ananthsub @awaelchli

@kaushikb11 kaushikb11 changed the title [RFC] Future of gpus/ipus/tpus with respect to devices [RFC] Future of gpus/ipus/tpu_cores with respect to devices Nov 8, 2021
@kaushikb11 kaushikb11 added design Includes a design discussion deprecation Includes a deprecation labels Nov 8, 2021
@ananthsub
Copy link
Contributor

ananthsub commented Nov 8, 2021

IMO we should follow the contributing guidelines: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md#main-core-value-one-less-thing-to-remember

Having multiple options in the public API to do the same thing is really confusing.

I'm in favor of devices=X, accelerator=Y since it's clearer how extensible this can be.

@four4fish
Copy link
Contributor

four4fish commented Nov 8, 2021

+1, totally agree

Current device related flags are confusing. Multiple flags partially overlap and interfere each other. When multiple flags passed in, we define prioritize and ignore some of the flags.

For example:
gpu=2, device=3 device will be ignored.
gpu=2, cpu=2, accelerator=cpu what will happen? I think cpu with num_process=2?

I prefer option 2, drop gpus, tpu_cores, ipus in the future and fully rely on devices
And can we have devices be int, not set =auto?

With this option: accelerator flag for device_type, devices (probably rename to devices_num?) for device_number. It's also scalable for new device types like hpus

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 9, 2021

I think going from this:

Trainer(gpus=2)

to

Trainer(devices=2, accelerator='gpu')

is a major step backwards in usability. now users have to dig into docs to understand how to use things. it definitely violates the "one-less-thing-to-remember" part of the API.

I guess, I'm just wondering why we're exploring this? I thought we were already pretty stable on the device API stuff

@justusschock
Copy link
Member

@williamFalcon The more kinds of accelerators we get, the more flags we will also have. Switching from Trainer(gpus=8) to Trainer(tpu_cores=8) also requires users to dig through the docs. Actually I find it easier to have Trainer(devices=2, accelerator='gpu'/'tpu') as the flags stay the same, it is easier to remember and also scaling better. So personally this would be the "one-less-thing-to-remember" for me.

Also I suspect, we would likely have the accelerator defaulting to 'auto' then which means that Trainer(devices=8) would run on gpu if available, on tpu if available and if no special accelerator is available it would fall back to cpu.

@tchaton
Copy link
Contributor

tchaton commented Nov 9, 2021

@williamFalcon As @justusschock shared, the previous approach doesn't scale well and makes discoverability harder.

Furthermore, the new API provides an auto as follows:

Trainer(devices="auto", accelerator="auto")

which would make the code runnable on every hardware without any code changes. Which isn't possible with the previous API.

And we could even support num_nodes discovery too.

Trainer(devices="auto", accelerator="auto", num_nodes="auto")

@SeanNaren
Copy link
Contributor Author

To address the discoverability issue, isn't it common to import the Trainer and see what parameters are available? Isn't this more common that going to the docs to find the parameter?

I opened the issue as I felt it was important as a community we come to an agreement as the idea was floating around a few PRs (with inconsistent agreement). It's important to have one single direction here (especially as we introduce other accelerators). I do strongly disagree with removing gpus/tpu_cores/ipus/hpus/cpus from the Trainer for primarily ease/discoverability.

I think it would be beneficial to try get community votes on this, so maybe a post on our General slack channel is warranted?

@ananthsub
Copy link
Contributor

ananthsub commented Nov 12, 2021

Even something like gpus as lightning defines them today is ambiguous. PyTorch also supports AMD GPUs: https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/

But this isn't supported at all with Lightning because when specifying gpus on the Trainer constructor, the Trainer assumes NVidia & CUDA are being used. On the other hand, PyTorch's device allows for more backends.

In my head the accelerator in Lightning maps to the torch device being used. By using the same sematics PyTorch offers, Lightning can keep parity more easily and smoothen the transition for users coming from vanilla PyTorch

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Adding to the conversation and @ananthsub comment. In this issue, a user is requesting to adding MIG for A100-like a machine: #10529.

This is another example of arguments like gpus, tpu-cores can grow out of control, and the need for a single device / accelerator arguments.

@daniellepintz
Copy link
Contributor

@tchaton do we have a consensus to move forward with this issue?

@zippeurfou
Copy link
Contributor

Maybe thought that is a bit different there.
Going back to @williamFalcon argument that:
Trainer(gpus=2) to Trainer(devices=2, accelerator='gpu') is more work for the user.
My question here is that as far as I understand a correct me if I am wrong you have CPU and then either gpu/tpu/hpu/...
That is why you can have something as follow:
Trainer(devices=2, accelerator='gpu') where accelerator is only one combination.
I also assume that our users "don't" want to care if it is gpu/tpu/hpu.. as their code will remain the same as long as it is not CPU and even if it is CPU PL help you there to make it seamless.
Finally, we can automatically detect what kind of accelerator you have available today with auto.
That being said what if we had a "wrapper" around anything that is non CPU such that we can keep the same structure while making it "easy" for the users.
ie. Trainer(cpus=2,xpus=2) this will automatically find if x is gpu/tpu/hpu.
Then we allow default to be auto ie. Trainer(cpus=null,xpu=null) or could use -1 for example.

@t-vi
Copy link

t-vi commented Nov 18, 2021

To take the pro-Accelerator argument to the extreme (also with the "fractional" devices), how about not splitting devices= and accelerator=?

If instantiating Accelerator all the time is too much of a hassle for @williamFalcon 's taste (I never liked the configuration part of tf sessions, either, and there is a good reason why PyTorch doesn't force you to do device = torch.Device("cuda") all over the thing but will just take "cuda"), how about:

Trainer(devices=2)   # I want two of whatever is available (so GPUs > CPUs in preference, but only the same kind. 

Occasions where "casual users" will have TPU GPU and IPU in the same box will be rare enough...
This is breaking because it would make "GPU if available" the default :( (though I never understood why it is not).

For more elaborate configs, one could have

Trainer(devices=Accelerator("cuda", 2))

My apologies for adding another color of shed, but to my mind, there are these cases we want to cater to:

  • The easy one! Needing to instantiate Accelerator is a bit more API for people to remember than just gpus=.... Personally, I have to concentrate really hard to know how many c and l to put in there, too.
  • The turbopropower-user: Would it not be more consistent and flexible to have Accelerator as the single truth about what their thing trains on? I certainly like to consider all my clusters of 512 DGXes for training BERT in 30 seconds a single device...
  • The unknown future. I think we'll see a lot more blur to "thing the training runs on = n devices of type a" that the proposed API of devices=2, accelerator=... suggests.

Best regards

Thomas

@tchaton
Copy link
Contributor

tchaton commented Nov 18, 2021

To add to @t-vi comment,

I believe the accelerator could be set to 'auto' by default as it is quite unlikely there is an overlapping machine with both GPUs and TPUs available.

So the hardware is totally abstracted and this provides an experience closer to Jax with their auto platform detection

Trainer(gpus=8) or Trainer(tpu_cores=8)  or Trainer(cpu=8)  or Trainer(hpus=8)  or Trainer(ipus=8) ...

would be replaced directly with:

Trainer(devices=8)

If a user has a machine with a GPU and wants to debug on CPU, he would simply add the accelerator to force the decision-making.

Trainer(devices=2, accelerator="cpu")

By the most critical point is:

I think we'll see a lot more blur to "thing the training runs on = n devices of type a" that the proposed API of devices=2, accelerator=... suggests.

I believe this API would need to provide a smarter hardware detection mechanism for MIG hardware.

@dlangerm
Copy link

Coming from both the High-Performance and embedded spaces, I'll weigh in here with some general thoughts.

  1. Often with large clusters, we have models and/or datasets which can't fit on a single node. If the API says gpus=2 or tpus=2, what control do I have over where those devices are or which devices get used for which parts of the model? Should PTL support this type of deployment at all?

  2. There are certain accelerators I might want to use which are only useful for inference but not training. FPGAs, for example, are really great for low-latency inference, but with the above API, do I have to instantiate a "Trainer" to use a device for inference? This makes little sense to me as a user. Is this something that PTL wants to support? If so, a rework is in order.

  3. There is research being done on heterogeneous architectures which have GPUs, DSPs, etc. available on a single node. The distribution of work and communication between these devices is non-trivial and a scheduling nightmare, but it's not too far off. Virtualized communication technology like CXL and composable infrastructure like Liquid will enable these types of nodes to "exist" in a cloud or on-prem cluster. I think PTL should be forward thinking and have these types of setups in mind, especially if it is to be adopted by the research community as a usable tool.

  4. A "device" as we think of it today (a GPU, a CPU) will likely be upended when in-memory processors come to the mainstream. (What is a "memory" device? How many "cores" does it have? It quickly loses any meaning). What about the Xilinx Versal architecture? It has many compute cores in a dynamic software-defined network fabric connected to an FPGA. It's one "device", but it's also many.

To the above points, I have a couple of suggestions:

  • The trainer should be agnostic to what it is executing on. It should be the object facilitating and orchestrating the training session (it is a trainer after all), but it shouldn't care what device is on the other end. If it does have knowledge of device-specifics, then as many of the users above pointed out, the API and argument count/complexity will explode if even just a few accelerators become mainstream and anything but basic training strategies are to be supported.

  • We should have an Accelerator API describing a device, its location, and its features. The average user shouldn't have to use this API at all, or even know it exists, and sane defaults should be set. However, it should be flexible enough to be used for cutting-edge device research. Where this accelerator API fits in the ecosystem is going to need to be decided by the community, but it shouldn't be passed to the trainer because if I have a device which is only for inference acceleration, then it makes no sense to create a trainer.

I am very interested to see where this discussion goes, and I apologize for the ramble.

@carmocca
Copy link
Contributor

carmocca commented Nov 18, 2021

This discussion has extended to other related points, but to give my opinion the original question, I fully agree with @tchaton's API vision here: #10410 (comment).

Where the original gpus, tpus, ... are deprecated and removed.

I don't think adding new options xpus=..., or devices=Accelerator("cuda", 2) should be in the cards anymore, as the new devices=2, format was just introduced in 1.5 and we would be once again deprecating this newly introduced functionality for a different thing. There's no clear winner here and we just need to choose one approach.

do I have to instantiate a "Trainer" to use a device for inference

it shouldn't be passed to the trainer because if I have a device which is only for inference acceleration, then it makes no sense to create a trainer.

Keep in mind that the Trainer has that name since it's been the core part of Lightning since the beginning, but it's way more than a "trainer" and could be thought of as an engine, for example, we have validate, test, and predict which are split from the training procedure

@williamFalcon
Copy link
Contributor

williamFalcon commented Dec 3, 2021

A lot of great inputs! Let me start off by summarizing:

The current API was built when only GPUs were supported. Then TPUs were added. And now a few years later, we live in a world where more alternatives are starting to emerge. This is the current API.

Trainer(gpus=2)
Trainer(tpus=2)

But now, we live in a world where more than GPU|TPU devices are coming out (HPU, etc...). In this case, the proposal is to modify the API like so:

Trainer(devices=2, accelerator='tpu')

Well... we also introduced the 'auto' flag, so the actual default call would look like this:

Trainer(devices=2)

# because Trainer(devices=2, accelerator='auto') is the default

@t-vi also brought up the alternative that there could be a class in the event that configs get unwieldy

Trainer(accelerator=Accelerator("cuda", 2))

@dlangerm also brought up that in certain complex scenarios:

  1. Multinode training (which we've supported from day 1 @dlangerm, and you specify the num_nodes argument). Today we already support selecting many configurations here, so i'm not sure what a relevant usecase is.
  2. Yes, PL already supports inference. We can think about configurations during inference needing a "Trainer" (@tchaton @carmocca maybe "Trainer" needs to be renamed in 2.0)
  3. Heterogenous hardware is an awesome upcoming research challenge that we'll be excited to tackle next year. But today, it's premature until research matures a bit more.
  4. In-memory processors also sound promising. If you know of a real use case, happy to collaborate on working out how to do something like that.
  5. @dlangerm we do have an accelerator API (it's been there since 1.0)... it's just used internally and not exposed to the user.

Decision

So, with all that said, if there's broader community support for moving from:

Trainer(gpus=2)

to:

Trainer(devices=2, accelerator='tpu')

# default is auto
Trainer(devices=2)

Then I'm happy to back this option as it is more scalable and my only concern is "having to only remember one thing"...
So, I'd love to hear more from the community about the effect on usability.

if there are no major quelms about this and everyone's excited, let's roll it out for 2.0
cc @tchaton @carmocca @ananthsub @daniellepintz

@daniellepintz
Copy link
Contributor

I have a question about this; if we want to roll this out for 2.0 when can we start working on it? Could we start working on it now for example?

@four4fish
Copy link
Contributor

four4fish commented Dec 8, 2021

@tchaton @awaelchli @ananthsub What's you guys' thought on when will be the right time for 2.0? Should Accelerator Refactor and stable accelerator be part of the 2.0?
I think it's better to have big changes at once. I will prefer having accelerator stable version and the flags changes addressed in the same version. It's easier to communicate with users and reduce future confusings.

@tchaton
Copy link
Contributor

tchaton commented Dec 8, 2021

Hey @daniellepintz @four4fish.

Yes, I agree with you both. I don't believe this change requires a Lightning 2.0 as this is a natural evolution of Lightning becoming hardware-agnostic directly at the Trainer level.

IMO, I would like to action this change for v1.6. @ananthsub @awaelchli @carmocca @kaushikb11 @justusschock Any thoughts on this ? If we are all positive to make this change, I will mark this issue as let's do it.

@kaushikb11
Copy link
Contributor

Agreed. We could go ahead with this change for v1.6, along with major Accelerator refactor.

@kaushikb11 kaushikb11 self-assigned this Dec 8, 2021
@dlangerm
Copy link

dlangerm commented Dec 8, 2021

Given the above decisions, is there a consensus on renaming Trainer to something more appropriate for the "Brain" or "Engine" that is has become?

If these changes are towards a hardware-agnostic API that can be used for either training or inference, Trainer will become very confusing. Even today, creating a Trainer instance to perform Trainer.predict is fairly unintuitive.

@justusschock
Copy link
Member

@dlangerm this is not really related to this issue, so I won't go into much detail here. Feel free to open a new issue for this discussion.

From my POV, we shouldn't rename the Trainer before 2.0.
Renaming flags is one thing (and we will need to have a pretty long deprecation cycle for those), but renaming the major components as the Trainer or LightningModule would be too much of a breaking change since this could also break the API in all other places as well.

@daniellepintz
Copy link
Contributor

Hey @kaushikb11 I saw you assigned yourself to this issue. I was planning on working on the accelerator_connector refactor (#10422) which was blocked by this issue. Am I okay to proceed with accelerator_connector refactor or is that something you were planning on doing?

@daniellepintz
Copy link
Contributor

I am working on this in #11040 - do we also want to deprecate num_processes and num_nodes?

@justusschock
Copy link
Member

I think num_processes yes, but num_nodes we might still need in case of multi-node training

@daniellepintz
Copy link
Contributor

Got it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Includes a deprecation design Includes a design discussion refactor
Projects
No open projects
Status: Accepted