Use Accelerate in `from_pretrained` for big model inference #17341

sgugger · 2022-05-18T20:30:01Z

What does this PR do?

This PR is a first draft for using the newly released big model inference APIs from Accelerate inside from_pretrained. For now it does this with the option low_cpu_mem_usage=True and:

instantiates the model inside the context manager to initialize empty weights (faster and less memory-intensive)
has the same behavior as before if no device_map is passed
otherwise will put each model weight on the specified device as the loading is done and properly sets the hook so that the model can still be used normally. As with Accelerate, device_map="auto" will auto-infer a proper device map with the available GPU(s) RAM and CPU RAM.

This PR is just a first step, there is a bit more cleanup to do, namely:

put the utils flagged as belonging in Accelerate there and once a new release of Accelerate is done, use them
clean up some old code (like move_model_to_meta_device)

Example of use:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", revision="sharded", device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("bigscience/T0pp")
inputs = tokenizer("Task: copy but say the opposite. PSG won its match against Barca.", return_tensors="pt")
inputs = inputs.to(0)

output = model.generate(inputs["input_ids"])
tokenizer.decode(output[0].tolist())

Still missing:

integration test
doc
add the "block" attribute to more model classes

sgugger · 2022-05-18T20:33:05Z

src/transformers/modeling_utils.py

-                setattr(submodule, param_name, new_val)
+    for param_name, param in state_dict.items():
+        # First part of the test is always true as load_state_dict_keys always contains state_dict keys.
+        if param_name not in loaded_state_dict_keys or param_name not in expected_keys:


First part of the test is left to be the same as before, but as said in the comment, it shouldn't be necessary as:

loaded_state_dict_keys = state_dict.keys() when the checkpoint is one file

loaded_state_dict_keys contains state_dict.keys() when the checkpoint is sharded

sgugger · 2022-05-18T20:34:30Z

src/transformers/modeling_utils.py

+                raise ValueError(f"{param_name} doesn't have any device set.")
+            param_device = device_map[module_name]
+
+        set_module_tensor_to_device(model, param_name, param_device, value=param)


This single line does the same thing as before using Accelerate. What's above is just:

using the right dtype

finding the right device

What's below deals with disk offload or temp offload of the CPU state dict.

sgugger · 2022-05-18T20:35:02Z

src/transformers/modeling_utils.py

@@ -870,6 +947,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
    base_model_prefix = ""
    main_input_name = "input_ids"
    _auto_class = None
+    _no_split_modules = None


New attributes to fill on all models (for now GPT-J and T5 given as examples), that specifies the block that should not be split across devices.

sorry just to understand - why should certain blocks not be split across devices?

If you split a GPTBlock across devices, the residual connection (initial input of the block) added at the end will create a device mismatch.

sgugger · 2022-05-18T20:35:35Z

src/transformers/modeling_utils.py

+        elif low_cpu_mem_usage:
+            init_contexts.append(init_empty_weights())
+
+        with ContextManagers(init_contexts):


Same as before + the no_init_weights context manage, but cleaner (IMO)

sgugger · 2022-05-18T20:35:47Z

src/transformers/modeling_utils.py

+        if device_map == "auto":
+            no_split_modules = [] if model._no_split_modules is None else model._no_split_modules
+            device_map = infer_auto_device_map(model, no_split_module_classes=no_split_modules, dtype=torch_dtype)


This where the auto device map is built.

sgugger · 2022-05-18T20:36:38Z

src/transformers/modeling_utils.py

+            offload_index = {} if device_map is not None and "disk" in device_map.values() else None
+            if offload_state_dict:
+                state_dict_folder = tempfile.mkdtemp()
+                state_dict_index = {}
+            else:
+                state_dict_folder = None
+                state_dict_index = None


For offloaded weights (either on disk or temp offload of the CPU weights) this index contains the map param_name -> metadata (shape and dtype)

sgugger · 2022-05-18T20:37:23Z

src/transformers/modeling_utils.py

                else:
                    error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix)

                # force memory release
                del state_dict
                gc.collect()

+            save_offload_index(offload_index, offload_folder)


Save the index for disk offload if necessary.

sgugger · 2022-05-18T20:39:00Z

src/transformers/modeling_utils.py

+            if offload_state_dict:
+                # Load back temporarily offloaded state dict
+                load_offloaded_weights(model, state_dict_index, state_dict_folder)
+                shutil.rmtree(state_dict_folder)


Reload the temp offloaded CPU state dict now that RAM is free.

HuggingFaceDocBuilderDev · 2022-05-18T20:45:37Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/modeling_utils.py

patrickvonplaten · 2022-05-18T22:30:08Z

src/transformers/modeling_utils.py

@@ -2013,18 +2124,22 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
        config.name_or_path = pretrained_model_name_or_path

        # Instantiate model.
+        init_contexts = [no_init_weights(_enable=_fast_init)]


if low_cpu_mem_usage = True then no_init_weights is not needed no? As far as I understand when low_cpu_mem_usage=True, then all weights will be either meta or pretrained weights and no init will happen anyways no? But guess it also doesn't hurt

It doesn't really hurt, but it shouldn't be needed, yes.

patrickvonplaten · 2022-05-18T22:37:53Z

Not necessarily linked to this PR, but in general the following code fails:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", low_cpu_mem_usage=True)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
inputs = tokenizer("Task: copy but say the opposite. PSG won its match against Barca.", return_tensors="pt")
#inputs = inputs.to(0)

output = model(inputs["input_ids"])

with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, meta and cpu!

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

stas00 · 2022-05-18T22:46:35Z

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

Warning, no, but assert yes - it's abnormal if a model returned with weights that are on meta. The whole meta device things is a behind the scenes hack and it shouldn't bleed out to user-land, IMHO.

patrickvonplaten · 2022-05-18T22:51:54Z

src/transformers/modeling_utils.py

+        if device_map is None:
+            param_device = "cpu"
+        else:
+            while len(module_name) > 0 and module_name not in device_map:


A comment would be super nice here to understand a bit what is happening.
Maybe something like:
# find next higher level module that is defined in device_map: bert.lm_head.weight -> bert.lm_head -> bert -> ''

patrickvonplaten

Cool! Also, tried it also out on OPT-30b and it works well

stas00

Thank you for working on this Sylvain - curious to see where it'd lead.

I run most of the deepspeed tests - nothing is broken.

Added a few nits.

Also please checkout out a related interesting new development at NVIDIA with GPUDIRECT https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html which would allow allocating tensors on disc.

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.

Additionally Tunji and I are working on a universal checkpoint for huge models which doesn't contain any topology data and can shrink/expand on the fly. This is based on my earlier proposal for a checkpoing format where each tensor is a separate file.

The problem with all other current approaches is that they require TBs of CPU memory for models like 176B if you have to manipulate optim_states, etc.

And the next step will be where we load a checkpoint and it'd use 0 CPU memory and will go directly from disc to the target GPU.

stas00 · 2022-05-18T23:24:47Z

src/transformers/modeling_utils.py

+        else:
+            while len(module_name) > 0 and module_name not in device_map:
+                module_name = ".".join(module_name.split(".")[:-1])
+            if module_name == "" and "" not in device_map:


what would device_map[""] signify?

If the whole model goes on the same device, the device_map is {"": device} when it's auto-inferred.

Thank you, Sylvain. - perhaps adding that in a comment would make it easier to follow the code

stas00 · 2022-05-18T23:27:18Z

src/transformers/modeling_utils.py

+
+                To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`.
+            offload_folder (`str` or `os.PathLike`, *optional*):
+                If the `device_map` contains any value `"disk"`, the folder where we will offload weights.


I wasn't able to parse this last sentence.

src/transformers/modeling_utils.py

stas00 · 2022-05-18T23:29:12Z

src/transformers/modeling_utils.py

+                If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
+            offload_state_dict (`bool`, *optional*, defaults to `False`):
+                If `True`, will temporarily offload the CPU state dict on the hard drive to avoig getting out of CPU
+                RAM if the weight of the CPU state dict + the biggest shard does not fit.


biggest shard of what?

src/transformers/modeling_utils.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

sgugger · 2022-05-19T14:04:41Z

Thanks a lot for your reviews @patrickvonplaten and @stas00 !
Here are a few answers to your general comments.

Should we maybe throw a nice warning in from_pretrained(...) that certain parameters are on meta and need to be manually initialized?

The model should be fully initialized outside of the meta device. I haven't checked yet models with randomly initialized heads (as the primary goal is inference) but will make sure this is fixed before merging.

Also please checkout out a related interesting new development at NVIDIA with GPUDIRECT https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html which would allow allocating tensors on disc.

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.

Once it's landed I'd be very interested in using it when DeepSpeed is available. Do you also know if they have plans to make their API to prefetch weights offloaded on the CPU/disk somewhat abailable?

Additionally Tunji and I are working on a universal checkpoint for huge models which doesn't contain any topology data and can shrink/expand on the fly. This is based on my earlier proposal for a checkpoing format where each tensor is a separate file.

The problem with all other current approaches is that they require TBs of CPU memory for models like 176B if you have to manipulate optim_states, etc.

Note that in this instance passing a device_map only works for model inference (not training). The best way to train large models is still to use DeepSpeed directly.

stas00 · 2022-05-19T15:07:46Z

Tunji is working on this feature in Deepspeed, this would allow tensor.to(nvme) and then use it as a normal tensor.
Once it's landed I'd be very interested in using it when DeepSpeed is available. Do you also know if they have plans to make their API to prefetch weights offloaded on the CPU/disk somewhat abailable?

@tjruwase, just a heads up - as you work on these new features - could you please consider making the offload/prefetch API public so that the HF Trainers and the core could make a direct use of those? Thank you!

Though I understand that it's deeply tied into the tracing mechanism, which is currently inseparable from the pre-fetch mechanism - the tracing mechanism figures out which params to prefetch and when. But perhaps we can discuss with Sylvain how he envisions using it.

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

LysandreJik

Very clean! Looking forward to the tests :)

src/transformers/modeling_utils.py

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

…ace#17341) * Initial work * More or less finished with first draft * Update src/transformers/modeling_utils.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Fix randomly initialized weights * Update src/transformers/modeling_utils.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Address review comments * Rename DeepSpeed folder to temporarily fix the test issue? * Revert to try if Accelerate fix works * Use latest Accelerate release * Quality and fixes * Style * Quality * Add doc * Test + fix * More blocks Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

sgugger added 2 commits May 17, 2022 07:28

Initial work

e4a0c02

More or less finished with first draft

7213a40

sgugger requested review from stas00, patrickvonplaten and LysandreJik May 18, 2022 20:30

sgugger commented May 18, 2022

View reviewed changes

patrickvonplaten reviewed May 18, 2022

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 18, 2022

View reviewed changes

patrickvonplaten approved these changes May 18, 2022

View reviewed changes

stas00 approved these changes May 18, 2022

View reviewed changes

Update src/transformers/modeling_utils.py

f2be43e

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

sgugger and others added 2 commits May 19, 2022 11:47

Update src/transformers/modeling_utils.py

a49aca4

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Fix randomly initialized weights

3969a8b

LysandreJik approved these changes May 19, 2022

View reviewed changes

src/transformers/modeling_utils.py Show resolved Hide resolved

sgugger and others added 10 commits May 19, 2022 17:02

Update src/transformers/modeling_utils.py

b85b6ae

Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

Address review comments

eeb36fd

Rename DeepSpeed folder to temporarily fix the test issue?

3645a7b

Revert to try if Accelerate fix works

abf27d4

Use latest Accelerate release

cd03d2a

Quality and fixes

7cba838

Style

8a9336d

Quality

1a79667

Add doc

c0c42a3

Test + fix

d6693bd

More blocks

fcc42ab

LysandreJik approved these changes May 23, 2022

View reviewed changes

sgugger merged commit 56f5059 into main May 23, 2022

sgugger deleted the from_pretrained_big_model branch May 23, 2022 18:32

patil-suraj mentioned this pull request Jun 14, 2022

[LongT5] disable model parallel test #17702

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Accelerate in `from_pretrained` for big model inference #17341

Use Accelerate in `from_pretrained` for big model inference #17341

sgugger commented May 18, 2022 •

edited

sgugger May 18, 2022

sgugger May 18, 2022

sgugger May 18, 2022

patrickvonplaten May 18, 2022 •

edited

sgugger May 19, 2022

sgugger May 18, 2022

sgugger May 18, 2022

sgugger May 18, 2022

sgugger May 18, 2022

sgugger May 18, 2022

HuggingFaceDocBuilderDev commented May 18, 2022 •

edited

patrickvonplaten May 18, 2022

sgugger May 19, 2022

patrickvonplaten commented May 18, 2022

stas00 commented May 18, 2022 •

edited

patrickvonplaten May 18, 2022 •

edited

patrickvonplaten left a comment

stas00 left a comment

stas00 May 18, 2022

sgugger May 19, 2022

stas00 May 19, 2022

stas00 May 18, 2022 •

edited

stas00 May 18, 2022

sgugger commented May 19, 2022 •

edited

stas00 commented May 19, 2022

LysandreJik left a comment

Use Accelerate in from_pretrained for big model inference #17341

Use Accelerate in from_pretrained for big model inference #17341

Conversation

sgugger commented May 18, 2022 • edited

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten May 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented May 18, 2022

stas00 commented May 18, 2022 • edited

patrickvonplaten May 18, 2022 • edited

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

stas00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 May 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger commented May 19, 2022 • edited

stas00 commented May 19, 2022

LysandreJik left a comment

Choose a reason for hiding this comment

Use Accelerate in `from_pretrained` for big model inference #17341

Use Accelerate in `from_pretrained` for big model inference #17341

sgugger commented May 18, 2022 •

edited

patrickvonplaten May 18, 2022 •

edited

HuggingFaceDocBuilderDev commented May 18, 2022 •

edited

stas00 commented May 18, 2022 •

edited

patrickvonplaten May 18, 2022 •

edited

stas00 May 18, 2022 •

edited

sgugger commented May 19, 2022 •

edited