Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Accelerate in from_pretrained for big model inference #17341

Merged
merged 16 commits into from May 23, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented May 18, 2022

What does this PR do?

This PR is a first draft for using the newly released big model inference APIs from Accelerate inside from_pretrained. For now it does this with the option low_cpu_mem_usage=True and:

  • instantiates the model inside the context manager to initialize empty weights (faster and less memory-intensive)
  • has the same behavior as before if no device_map is passed
  • otherwise will put each model weight on the specified device as the loading is done and properly sets the hook so that the model can still be used normally. As with Accelerate, device_map="auto" will auto-infer a proper device map with the available GPU(s) RAM and CPU RAM.

This PR is just a first step, there is a bit more cleanup to do, namely:

  • put the utils flagged as belonging in Accelerate there and once a new release of Accelerate is done, use them
  • clean up some old code (like move_model_to_meta_device)

Example of use:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", revision="sharded", device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
inputs = tokenizer("Task: copy but say the opposite. PSG won its match against Barca.", return_tensors="pt")
inputs = inputs.to(0)

output = model.generate(inputs["input_ids"])
tokenizer.decode(output[0].tolist())

Still missing:

  • integration test
  • doc
  • add the "block" attribute to more model classes

setattr(submodule, param_name, new_val)
for param_name, param in state_dict.items():
# First part of the test is always true as load_state_dict_keys always contains state_dict keys.
if param_name not in loaded_state_dict_keys or param_name not in expected_keys:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of the test is left to be the same as before, but as said in the comment, it shouldn't be necessary as:

  • loaded_state_dict_keys = state_dict.keys() when the checkpoint is one file
  • loaded_state_dict_keys contains state_dict.keys() when the checkpoint is sharded

raise ValueError(f"{param_name} doesn't have any device set.")
param_device = device_map[module_name]

set_module_tensor_to_device(model, param_name, param_device, value=param)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This single line does the same thing as before using Accelerate. What's above is just:

  • using the right dtype
  • finding the right device

What's below deals with disk offload or temp offload of the CPU state dict.

@@ -870,6 +947,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
base_model_prefix = ""
main_input_name = "input_ids"
_auto_class = None
_no_split_modules = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New attributes to fill on all models (for now GPT-J and T5 given as examples), that specifies the block that should not be split across devices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just to understand - why should certain blocks not be split across devices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you split a GPTBlock across devices, the residual connection (initial input of the block) added at the end will create a device mismatch.

elif low_cpu_mem_usage:
init_contexts.append(init_empty_weights())

with ContextManagers(init_contexts):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before + the no_init_weights context manage, but cleaner (IMO)

Comment on lines 2140 to 2142
if device_map == "auto":
no_split_modules = [] if model._no_split_modules is None else model._no_split_modules
device_map = infer_auto_device_map(model, no_split_module_classes=no_split_modules, dtype=torch_dtype)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This where the auto device map is built.

Comment on lines +2349 to +2355
offload_index = {} if device_map is not None and "disk" in device_map.values() else None
if offload_state_dict:
state_dict_folder = tempfile.mkdtemp()
state_dict_index = {}
else:
state_dict_folder = None
state_dict_index = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For offloaded weights (either on disk or temp offload of the CPU weights) this index contains the map param_name -> metadata (shape and dtype)

else:
error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix)

# force memory release
del state_dict
gc.collect()

save_offload_index(offload_index, offload_folder)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save the index for disk offload if necessary.

Comment on lines +2395 to +2398
if offload_state_dict:
# Load back temporarily offloaded state dict
load_offloaded_weights(model, state_dict_index, state_dict_folder)
shutil.rmtree(state_dict_folder)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reload the temp offloaded CPU state dict now that RAM is free.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 18, 2022

The documentation is not available anymore as the PR was closed or merged.

@@ -2013,18 +2124,22 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
config.name_or_path = pretrained_model_name_or_path

# Instantiate model.
init_contexts = [no_init_weights(_enable=_fast_init)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if low_cpu_mem_usage = True then no_init_weights is not needed no? As far as I understand when low_cpu_mem_usage=True, then all weights will be either meta or pretrained weights and no init will happen anyways no? But guess it also doesn't hurt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really hurt, but it shouldn't be needed, yes.

@patrickvonplaten
Copy link
Contributor

Not necessarily linked to this PR, but in general the following code fails:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("Task: copy but say the opposite. PSG won its match against Barca.", return_tensors="pt")
#inputs = inputs.to(0)

output = model(inputs["input_ids"])

with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, meta and cpu!

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

@stas00
Copy link
Contributor

stas00 commented May 18, 2022

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

Warning, no, but assert yes - it's abnormal if a model returned with weights that are on meta. The whole meta device things is a behind the scenes hack and it shouldn't bleed out to user-land, IMHO.

if device_map is None:
param_device = "cpu"
else:
while len(module_name) > 0 and module_name not in device_map:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment would be super nice here to understand a bit what is happening.
Maybe something like:
# find next higher level module that is defined in device_map: bert.lm_head.weight -> bert.lm_head -> bert -> ''

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Also, tried it also out on OPT-30b and it works well

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this Sylvain - curious to see where it'd lead.

I run most of the deepspeed tests - nothing is broken.

Added a few nits.

Also please checkout out a related interesting new development at NVIDIA with GPUDIRECT https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html which would allow allocating tensors on disc.

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.

Additionally Tunji and I are working on a universal checkpoint for huge models which doesn't contain any topology data and can shrink/expand on the fly. This is based on my earlier proposal for a checkpoing format where each tensor is a separate file.

The problem with all other current approaches is that they require TBs of CPU memory for models like 176B if you have to manipulate optim_states, etc.

And the next step will be where we load a checkpoint and it'd use 0 CPU memory and will go directly from disc to the target GPU.

else:
while len(module_name) > 0 and module_name not in device_map:
module_name = ".".join(module_name.split(".")[:-1])
if module_name == "" and "" not in device_map:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would device_map[""] signify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the whole model goes on the same device, the device_map is {"": device} when it's auto-inferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Sylvain. - perhaps adding that in a comment would make it easier to follow the code


To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`.
offload_folder (`str` or `os.PathLike`, *optional*):
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
Copy link
Contributor

@stas00 stas00 May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to parse this last sentence.

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
offload_state_dict (`bool`, *optional*, defaults to `False`):
If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU
RAM if the weight of the CPU state dict + the biggest shard does not fit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

biggest shard of what?

src/transformers/modeling_utils.py Show resolved Hide resolved
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
@sgugger
Copy link
Collaborator Author

sgugger commented May 19, 2022

Thanks a lot for your reviews @patrickvonplaten and @stas00 !
Here are a few answers to your general comments.

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

The model should be fully initialized outside of the meta device. I haven't checked yet models with randomly initialized heads (as the primary goal is inference) but will make sure this is fixed before merging.

Also please checkout out a related interesting new development at NVIDIA with GPUDIRECT https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html which would allow allocating tensors on disc.

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.

Once it's landed I'd be very interested in using it when DeepSpeed is available. Do you also know if they have plans to make their API to prefetch weights offloaded on the CPU/disk somewhat abailable?

Additionally Tunji and I are working on a universal checkpoint for huge models which doesn't contain any topology data and can shrink/expand on the fly. This is based on my earlier proposal for a checkpoing format where each tensor is a separate file.

The problem with all other current approaches is that they require TBs of CPU memory for models like 176B if you have to manipulate optim_states, etc.

Note that in this instance passing a device_map only works for model inference (not training). The best way to train large models is still to use DeepSpeed directly.

@stas00
Copy link
Contributor

stas00 commented May 19, 2022

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.
Once it's landed I'd be very interested in using it when DeepSpeed is available. Do you also know if they have plans to make their API to prefetch weights offloaded on the CPU/disk somewhat abailable?

@tjruwase, just a heads up - as you work on these new features - could you please consider making the offload/prefetch API public so that the HF Trainers and the core could make a direct use of those? Thank you!

Though I understand that it's deeply tied into the tracing mechanism, which is currently inseparable from the pre-fetch mechanism - the tracing mechanism figures out which params to prefetch and when. But perhaps we can discuss with Sylvain how he envisions using it.

sgugger and others added 2 commits May 19, 2022 11:47
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean! Looking forward to the tests :)

src/transformers/modeling_utils.py Show resolved Hide resolved
@sgugger sgugger merged commit 56f5059 into main May 23, 2022
@sgugger sgugger deleted the from_pretrained_big_model branch May 23, 2022 18:32
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…ace#17341)

* Initial work

* More or less finished with first draft

* Update src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update src/transformers/modeling_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Fix randomly initialized weights

* Update src/transformers/modeling_utils.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

* Address review comments

* Rename DeepSpeed folder to temporarily fix the test issue?

* Revert to try if Accelerate fix works

* Use latest Accelerate release

* Quality and fixes

* Style

* Quality

* Add doc

* Test + fix

* More blocks

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants