Introduce OptimizerArgs and add support for GaLore #1192

rasbt · 2024-03-25T21:53:39Z

The current implementation adds GaLore to the full finetuning script.

Example

# regular
litgpt finetune full \
  --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
  --data Alpaca2k \
  --train.max_steps 5 

# Training time: 14.13s
# Memory used: 3.44 GB



# with galore
litgpt finetune full \
  --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
  --data Alpaca2k \
  --train.max_steps 5  \
  --galore.use_galore true

# Training time: 23.59s
# Memory used: 3.44 GB



# with 8bit galore
litgpt finetune full \
  --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
  --data Alpaca2k \
  --train.max_steps 5  \
  --galore.use_galore true \
  --galore.galore_8bit

# Training time: 17.96s
# Memory used: 2.47 GB

Discuss

We could also add it to LoRA

this would require a check that GaLore is only used when QLoRA is disabled
we can actually use it with some bnb precision settings (this would be supported according to them via GaloreAdamW8Bit)

I specified the galore args similar to what we do with lora. But since this is more an addon to existing methods like full and lora , should we maybe make this part of TrainArgs?

We can also think about making a dedicated subcommand like for qlora in the future. Ie..,

litgpt finetune full --config ... 

litgpt finetune lora --config ... 

litgpt finetune qlora --config ... [in progress]

litgpt finetune galore --config ... [maybe in future]

Todos

Fixes #1075

pyproject.toml

litgpt/finetune/lora.py

rasbt · 2024-03-26T16:49:34Z

After our discussion today, I think we should only enable vanilla Galore for now, not worrying about the LoRA support. We can look into LoRA support later if there is high-demand. I am getting some precision-related errors when trying to use it with LoRA, which has likely something to do with the precision that is used by the Galore optimizer under the hood. I am expecting the Galore package to evolve in the upcoming weeks and months, and we can then revist if LoRA works without us having to make additional tweaks to the GaLore optimizer etc.

Andrei-Aksionov · 2024-03-26T17:19:17Z

@rasbt How much of an improvement in VRAM consumption you saw with LoRA+GaLore?
With any PEFT algo the amount of parameters to optimize shouldn't be that significant.

rasbt · 2024-03-27T00:57:53Z

The combination of LoRA + GaLore doesn't really work yet due precision mismatches when merging the LoRA weighs at the end so it didn't get to the code line that prints the memory usage. I could comment out the merging and try it again, but I think let's just focus on Galore for full finetuning first. Like you said, I don't expect a big improvement when combined with LoRA.

rasbt · 2024-05-03T22:00:32Z

I changed the GaloreArgs to OptimizerArgs and here are some results for phi-2. What's puzzling is the pretraining performance. I couldn't find the issue and may need to investigate more. Also need to update the config files once we settled on the API.

Full

AdamW

litgpt finetune full \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5

# Training time: 32.76s
# Memory used: 55.84 GB

GaLore

litgpt finetune full \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw"

# Training time: 128.55s
# Memory used: 36.14 GB

GaLore 8-bit

litgpt finetune full \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw_8bit"

# Training time: 128.68s
# Memory used: 33.81 GB

LoRA

AdamW

litgpt finetune lora \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5

# Training time: 36.43s
# Memory used: 18.56 GB

GaLore

litgpt finetune lora \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw"

# Training time: 25.98s
# Memory used: 18.56 GB

GaLore 8-bit

litgpt finetune lora \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw_8bit"

# Training time: 26.01s
# Memory used: 18.54 GB

Adapter

AdamW

litgpt finetune adapter \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5

# Training time: 31.16s
# Memory used: 17.94 GB

GaLore

litgpt finetune adapter \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw"

# Training time: 24.81s
# Memory used: 17.94 GB

GaLore 8-bit

litgpt finetune adapter_v2 \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw_8bit"

# Training time: 26.36s
# Memory used: 20.10 GB

Adapter v2

AdamW

litgpt finetune adapter_v2 \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5

# Training time: 26.35s
# Memory used: 20.11 GB

GaLore

litgpt finetune adapter_v2 \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw"
# Training time: 26.31s
# Memory used: 20.11 GB

GaLore 8-bit

litgpt finetune adapter_v2 \
  --checkpoint_dir checkpoints/microsoft/phi-2/ \
  --train.max_steps 5 \
  --optim.optimizer "galore_adamw_8bit"
# Training time: 26.26s
# Memory used: 20.10 GB

Pretrain (Pythia 14M)

AdamW

litgpt pretrain \
  --model_name pythia-14m \
  --tokenizer_dir checkpoints/EleutherAI/pythia-14m/ \
  --data TextFiles \
  --data.train_data_path "custom_texts" \
  --train.max_tokens 100_000

# Training time: 34.07s
# Memory used: 1.44 GB

GaLore

litgpt pretrain \
  --model_name pythia-14m \
  --tokenizer_dir checkpoints/EleutherAI/pythia-14m/ \
  --data TextFiles \
  --data.train_data_path "custom_texts" \
  --train.max_tokens 100_000 \
  --optim.optimizer "galore_adamw"

# Training time: 25.31s
# Memory used: 1.44 GB

GaLore 8-bit

litgpt pretrain \
  --model_name pythia-14m \
  --tokenizer_dir checkpoints/EleutherAI/pythia-14m/ \
  --data TextFiles \
  --data.train_data_path "custom_texts" \
  --train.max_tokens 100_000 \
  --optim.optimizer "galore_adamw_8bit"
# Training time: 25.31s
# Memory used: 1.44 GB

rasbt · 2024-05-06T23:11:01Z

I tried many things and even ended up replacing all instances of torch's AdamW with Galore's to make sure it's actually used, but for for some reason, I cannot see any difference in memory usage when pretraining. Mind boggling.

rasbt · 2024-05-09T18:50:39Z

I changed the hardcoded galore arguments to general extra_kwargs so they could be used for other optimizer options as well. This way it adds less clutter to the CLI.

So, what's new is that we now have optimizer kwargs. E.g., this adds

# Optimizer-related arguments
optim: 
  # Which optimizer to use. Possible choices: "adamw", "galore_adamw", "galore_adamw_8bit". (type: Optional[str], default: "adamw")
  optimizer: "adamw"

  #   (type: float, default: 0.0003)
  learning_rate: 0.0002

  #   (type: float, default: 0.02)
  weight_decay: 0.0

  #   (type: float, default: 0.9)
  beta1: 0.9

  #   (type: float, default: 0.95)
  beta2: 0.95

  # Additional optimizer keyword arguments, for example, "rank=8,update_proj_gap=200" for GaLore. (type: Optional[str], default: None)
  extra_kwargs:

What do you think about this approach and interface @carmocca @lantiga @awaelchli ?

carmocca · 2024-05-10T12:26:03Z

The jsonargparse-y way of doing this would be to instead specify which Optimizer class you want to select to let the parser pull out the arguments of said class. For example, that is exactly how the data is selected and parsed

rasbt · 2024-05-10T16:22:45Z

OMG I made it way more complicated than it need be 🤦‍♂️. Thanks for the hint. Now I know.

rasbt · 2024-05-10T17:30:55Z

After trying this, I realize that this may not be cleanly possible because optimizers require params as positional argument. So we would have to wrap the optimizer in our own optimizer class. The other problem is with the Galore optimizer, which needs to split the params into regular params and galore params prior to passing them. It kind of gets ugly real quick.

We could probably have this jsonargparse approach for PyTorch native optimizers, but I don't think it will be easy to support Galore this way in a non-hacky way.

I can make a PR with just PyTorch optimizer support and then we can decide whether which route want to go, only supporting PyTorch optimizers, or revisiting this implementation here with our own extra_args parsing.

carmocca · 2024-05-10T17:43:05Z

Yes, we cannot have jsonargparse instantiate the class directly for that reason.

But you can still tell it to add all the arguments of a class (or classes) into a group of args, basically getting you OptimizerArgs automatically for that class. Then those args can be used to instantiate the real optimizer instance later in the script.

The PyTorch Lightning CLI implementation works that way: https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/cli.py#L154-L177

rasbt · 2024-05-10T18:25:34Z

Arg, I am still struggling with this.

I.e.,

 litgpt finetune full --optimizer.help torch.optim.AdamW

works without problem but then even if I don't do anything else, jsonargparse tries to initialize it already via

litgpt finetune full  ... --optimizer torch.optim.AdamW

before I can pass it to anything else. Not sure how to avoid that.
I think I need to study jsonargparse a bit better because right now I feel like I am trying to hack things together somehow ...

carmocca · 2024-05-10T18:56:57Z

You can start by understanding this minimal example:

import torch
import jsonargparse

parser = jsonargparse.ArgumentParser()
parser.add_subclass_arguments(torch.optim.Optimizer, "optimizer", instantiate=False, fail_untyped=False, skip={"params"})
args = parser.parse_args()
print(args)

python example.py --optimizer Adam   
Namespace(optimizer=Namespace(class_path='torch.optim.Adam', init_args=Namespace(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, foreach=None, maximize=False, capturable=False, differentiable=False, fused=None)))

carmocca · 2024-05-10T19:03:14Z

And here's how you would use the above to instantiate the optimizer:

from typing import Any, Tuple, Dict, Union

def instantiate_class(args: Union[Any, Tuple[Any, ...]], init: Dict[str, Any]) -> Any:
    """Instantiates a class with the given args and init.

    Args:
        args: Positional arguments required for instantiation.
        init: Dict of the form {"class_path":...,"init_args":...}.

    Returns:
        The instantiated class object.

    """
    kwargs = init.get("init_args", {})
    if not isinstance(args, tuple):
        args = (args,)
    class_module, class_name = init["class_path"].rsplit(".", 1)
    module = __import__(class_module, fromlist=[class_name])
    args_class = getattr(module, class_name)
    return args_class(*args, **kwargs)


model = torch.nn.Linear(1, 1)
optimizer = instantiate_class(model.parameters(), init=args["optimizer"])
print(optimizer)

We define instantiate_class for the PyTorch Lightning CLI here: https://github.com/Lightning-AI/pytorch-lightning/blob/90d04b5b86f37994cdceccc6de32f0e93b1cc7f0/src/lightning/pytorch/cli.py#L752-L769

Add support for GaLore

bb5256d

rasbt requested review from awaelchli, carmocca and lantiga as code owners March 25, 2024 21:53

rasbt mentioned this pull request Mar 25, 2024

support GaLore #1075

Open

rasbt marked this pull request as draft March 25, 2024 22:00

rasbt added 3 commits March 25, 2024 22:04

add 8bit support

887eef4

docstrings

72edbdd

galore for lora

8a47d07

carmocca reviewed Mar 26, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

litgpt/finetune/lora.py Outdated Show resolved Hide resolved

rasbt mentioned this pull request Mar 29, 2024

domain specific fine-tuning #1173

Open

rasbt added 10 commits April 9, 2024 20:52

implement GaLoreArgs

b4257ca

Merge branch 'main' into galore

798f939

galore for pretraining

6bfca21

Merge branch 'main' into galore

cffe285

remove dependency

0e21bc8

add galore internally

bb58e3c

Merge branch 'main' into galore

8068b81

fix data class

0c4bac5

update finetune configs

3352a7a

galore for all methods

7f1b475

rasbt mentioned this pull request Apr 27, 2024

Add support for memory-efficient and faster optimizers #1364

Open

rasbt added 2 commits May 3, 2024 15:04

Merge branch 'main' into galore

fbaf7d2

resolve conflict

fc17c57

rasbt added 2 commits May 6, 2024 20:02

fix get_linear_nonlinear_params

bd73193

Merge branch 'main' into galore

b5b472f

rasbt added 3 commits May 8, 2024 21:31

finetune full rewrite

4bc9c51

improvements

88634cc

rewrite with optimizer extra_args

4733bf7

rasbt marked this pull request as ready for review May 9, 2024 18:27

rasbt added 2 commits May 9, 2024 13:28

Merge branch 'main' into galore

cd8db2b

thunder pretrain

f07b76b

rasbt added 2 commits May 9, 2024 18:52

add back new line to configs

903c383

add utils test

0dd1d39

rasbt added the breaking change label May 9, 2024

rasbt changed the title ~~Add support for GaLore~~ Introduce OptimizerArgs and add support for GaLore May 9, 2024

rasbt added 2 commits May 10, 2024 00:39

improvements and benchmarks

14db26b

acknowledgements

0a46983

rasbt requested a review from williamFalcon as a code owner May 10, 2024 00:42

fix duplicates

c199745

rasbt mentioned this pull request May 10, 2024

OptimizerArgs #1409

Merged

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce OptimizerArgs and add support for GaLore #1192

Introduce OptimizerArgs and add support for GaLore #1192

rasbt commented Mar 25, 2024 •

edited

rasbt commented Mar 26, 2024

Andrei-Aksionov commented Mar 26, 2024

rasbt commented Mar 27, 2024

rasbt commented May 3, 2024 •

edited

rasbt commented May 6, 2024

rasbt commented May 9, 2024

carmocca commented May 10, 2024

rasbt commented May 10, 2024

rasbt commented May 10, 2024

carmocca commented May 10, 2024

rasbt commented May 10, 2024

carmocca commented May 10, 2024

carmocca commented May 10, 2024

Introduce OptimizerArgs and add support for GaLore #1192

Are you sure you want to change the base?

Introduce OptimizerArgs and add support for GaLore #1192

Conversation

rasbt commented Mar 25, 2024 • edited

Example

Discuss

Todos

rasbt commented Mar 26, 2024

Andrei-Aksionov commented Mar 26, 2024

rasbt commented Mar 27, 2024

rasbt commented May 3, 2024 • edited

Full

AdamW

GaLore

GaLore 8-bit

LoRA

AdamW

GaLore

GaLore 8-bit

Adapter

AdamW

GaLore

GaLore 8-bit

Adapter v2

AdamW

GaLore

GaLore 8-bit

Pretrain (Pythia 14M)

AdamW

GaLore

GaLore 8-bit

rasbt commented May 6, 2024

rasbt commented May 9, 2024

carmocca commented May 10, 2024

rasbt commented May 10, 2024

rasbt commented May 10, 2024

carmocca commented May 10, 2024

rasbt commented May 10, 2024

carmocca commented May 10, 2024

carmocca commented May 10, 2024

rasbt commented Mar 25, 2024 •

edited

rasbt commented May 3, 2024 •

edited