[WIP] Add OLMo #927

rasbt · 2024-02-12T20:30:52Z

Adds the popular and fully open-source OLMo models by Allen AI.

Fixes #925

lit_gpt/config.py

rasbt · 2024-02-13T20:07:36Z

I'm a big stuck with the conversion and would appreciate your advice and ideas @carmocca or @Andrei-Aksionov!

So, here are 3 special things about Olmo:

they used weight tying like in GPT-2: They reuse the WTE weight as the output projection weight. The way they saved the tensors on the Hub though they simply duplicated that tensor so there shouldn't be any action required. When loading the model in HuggingFace, I checked that olmo.model.transformer.wte.weight and olmo.model.transformer.ff_out.weight contain the same tensor. That should be all good here.
They use a non-parametric LayerNorm. I.e., their LayerNorm doesn't have the scale (weight) and shift (bias) parameters. To avoid any code changes just for that model, my workaround is to just use zeros and ones so that these have no effect:

        state_dict[f"transformer.h.{l}.norm_1.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_1.bias"] = torch.zeros(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.bias"] = torch.zeros(config.n_embd)

The problem is that I'm missing weights ...

The HF version is like this, which is confusing, because the ops are not applied in that "sequential" order as far as I can tell:

OLMoForCausalLM(
  (model): Olmo(
    (transformer): ModuleDict(
      (wte): Embedding(50304, 2048)
      (emb_drop): Dropout(p=0.0, inplace=False)
      (ln_f): LayerNorm()
      (blocks): ModuleList(
        (0-15): 16 x OlmoSequentialBlock(
          (dropout): Dropout(p=0.0, inplace=False)
          (act): SwiGLU()
          (attn_out): Linear(in_features=2048, out_features=2048, bias=False)
          (ff_out): Linear(in_features=8192, out_features=2048, bias=False)
          (rotary_emb): RotaryEmbedding()
          (attn_norm): LayerNorm()
          (ff_norm): LayerNorm()
          (att_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (ff_proj): Linear(in_features=2048, out_features=16384, bias=False)
        )
      )
      (ff_out): Embedding(50304, 2048)
    )
  )
)

Unless I'm wrong, I think what happens is that ff_proj is a placeholder for the mlp F1 and FC2 layers. I.e., the first half is FC1 and the second half is FC2. It's kind of confusing though.

What I am thinking is that we have to split the fc weights, which would avoid us having to write some custom code in the GPT model class:

    weight_map = {
        "model.transformer.wte.weight": "transformer.wte.weight",
        "model.transformer.ff_out.weight": "lm_head.weight",
        "model.transformer.blocks.{}.attn_out.weight": "transformer.h.{}.attn.proj.weight",
        "model.transformer.blocks.{}.ff_proj.weight": "transformer.h.{}.mlp.fc_1.weight", # split into fc1 and fc2
        "model.transformer.blocks.{}.att_proj.weight": "transformer.h.{}.attn.attn.weight",
        "model.transformer.blocks.{}.ff_out.weight": "transformer.h.{}.mlp.proj.weight",
    }
...

    for l in range(config.n_layer):
        state_dict[f"transformer.h.{l}.mlp.fc_2.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][config.n_embd:]
        state_dict[f"transformer.h.{l}.mlp.fc_1.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][:config.n_embd]

is this somehow possible with the lit-gpt SaveProxyTensor?

carmocca · 2024-02-13T22:02:07Z

Hey! All your suggestions make sense to me. You should be able to split the combined ff linear as you suggest, especially if load_param has ben called already. We also manipulate the qkv linears for llama2 checkpoints in a similar way.

However, note that your workarounds will only work for inference. During training, wte and ff_out will not be tied and the layernorm parameters wont be frozen.

Andrei-Aksionov · 2024-02-14T15:11:37Z

Hello @rasbt

Looks like you are correct. I just wanted to add a couple of things that I've noticed while reviewing their code. For posterity, sorta speak.

I also don't like when the layers are initialized not in the order they are executed. Lit-GPT also does it: first we create lm_head and only then transformer layers 🙃.

So the order of execution should be as such:

# Attention
1. attn_norm
2. attn_proj (2048, 6144) <- combined QKV
3. rotary_emb
4. attn_out (2048, 2048)
5. dropout
# MLP
1. ff_norm
2. ff_proj (2048, 16384) <- combined [fc_2, fc_1] / [up, gate] in LlaMA notation
3. act
4. ff_out (8192, 2048)
5. dropout

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.
Their LayerNorm class supports weight and bias parameters, but it's controlled by the config. It looks like they turned off .weight and .bias per config.
This deserves a bit more explanation.
In LlaMAMLP class we have fc_1, fc_2 and proj. During the forward pass we apply fc_1 and fc_2 on the input separately:
https://github.com/Lightning-AI/lit-gpt/blob/f5d68065ff621fc2cc190c05dcc4ab2cda1d1f57/lit_gpt/model.py#L286-L290

Olmo has only two layer: ff_proj and ff_out. They decided to take an approach that is similar to a combined QKV matrix and created ff_proj layer that does this matmul op in one go. But then, the way they split the result is I would say an unexpected - in the activation function:

def forward(self, x: torch.Tensor) -> torch.Tensor:
        x, gate = x.chunk(2, dim=-1)
        return F.silu(gate) * x

and then they apply ff_out to it.

Important to note the way they split and then apply activation function on a chunk. That means that:
ff_proj == [fc_2, fc_1]
ff_out == proj

rasbt · 2024-02-14T16:37:15Z

Thanks so much for the feedback @carmocca and @Andrei-Aksionov , this was super helpful! After more tinkering, I went with a custom OLMoMLP (analogous to LLaMALMLP) because I thought this was easier than the other workarounds -- both from an implementation perspective but also code-readability in the future.

The weights load ok now, but for some reason, the results are garbage. E.g., for

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-1b/

What food do llamas eat?lerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslers

And

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-7b/

What food do llamas eat? nic except ' Up Area has , climate * new area even county bun dressingall Bul Index millions Di withdrawal intent except / bun ID tonnes approve welcome St/ regimes health ng est African worse Multiple; p; ques Up ( IL'' Area / p

rasbt · 2024-02-14T16:44:10Z

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.

Actually, upon further inspection they only use weight tying for the 1B (https://huggingface.co/allenai/OLMo-1B/blob/main/config.json#L42) model not for the 7B model (https://huggingface.co/allenai/OLMo-7B/blob/main/config.json#L42). I adjusted the code accordingly. Still not working well though.

carmocca · 2024-02-14T16:47:45Z

I would strongly prefer that we don't add this new MLP class.

To debug the output, you'll have to inspect the activations for both models layer by layer to see where they diverge

rasbt · 2024-02-14T16:54:16Z

I would strongly prefer that we don't add this new MLP class.

Ok! Maybe let's leave it in there until we got it to work, and then we can refactor it into one of the existing classes somehow.

rasbt · 2024-02-14T22:08:28Z

Just to add a note about pinpointing the difference. With Carlos's help, we found that the difference currently is in how the QKV matrix is split into queries, keys, and values.

https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L195-L202

and

https://github.com/allenai/OLMo/blob/main/olmo/model.py#L687
https://github.com/allenai/OLMo/blob/main/olmo/model.py#L559-L571

In Lit-GPT, the Q, K, and V are interleaved (to also support MQA) whereas in OLMo, QKV are not interleaved.

We could potentially accommodate OLMo in Lit-GPT if we apply the steps here from Llama in the conversion script but in reverse: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L182-L186

rasbt requested review from awaelchli, carmocca and lantiga as code owners February 12, 2024 20:30

rasbt added the checkpoints label Feb 12, 2024

Andrei-Aksionov reviewed Feb 13, 2024

View reviewed changes

lit_gpt/config.py Outdated Show resolved Hide resolved

rasbt mentioned this pull request Mar 5, 2024

Drop interleave placement in QKV matrix #1013

Open

rasbt added 9 commits March 18, 2024 22:12

wip olmo

96ef700

fix url

7a71d9f

hf checkpoint conversion

d49da7a

padded vocab

f165313

update config and conversion

c59908c

save proxytensors

1c23181

layernorm adjust

ea98353

fix weight loading with custom OLMoMLP

19444a2

add olmo unit test

c936358

rasbt force-pushed the olmo branch from 6f3d7b4 to c936358 Compare March 18, 2024 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add OLMo #927

[WIP] Add OLMo #927

rasbt commented Feb 12, 2024 •

edited

rasbt commented Feb 13, 2024 •

edited

carmocca commented Feb 13, 2024

Andrei-Aksionov commented Feb 14, 2024

rasbt commented Feb 14, 2024 •

edited

rasbt commented Feb 14, 2024 •

edited

carmocca commented Feb 14, 2024

rasbt commented Feb 14, 2024

rasbt commented Feb 14, 2024 •

edited by carmocca

[WIP] Add OLMo #927

Are you sure you want to change the base?

[WIP] Add OLMo #927

Conversation

rasbt commented Feb 12, 2024 • edited

rasbt commented Feb 13, 2024 • edited

carmocca commented Feb 13, 2024

Andrei-Aksionov commented Feb 14, 2024

rasbt commented Feb 14, 2024 • edited

rasbt commented Feb 14, 2024 • edited

carmocca commented Feb 14, 2024

rasbt commented Feb 14, 2024

rasbt commented Feb 14, 2024 • edited by carmocca

rasbt commented Feb 12, 2024 •

edited

rasbt commented Feb 13, 2024 •

edited

rasbt commented Feb 14, 2024 •

edited

rasbt commented Feb 14, 2024 •

edited

rasbt commented Feb 14, 2024 •

edited by carmocca