New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add OLMo #927
base: main
Are you sure you want to change the base?
[WIP] Add OLMo #927
Conversation
I'm a big stuck with the conversion and would appreciate your advice and ideas @carmocca or @Andrei-Aksionov! So, here are 3 special things about Olmo:
The HF version is like this, which is confusing, because the ops are not applied in that "sequential" order as far as I can tell:
Unless I'm wrong, I think what happens is that What I am thinking is that we have to split the fc weights, which would avoid us having to write some custom code in the GPT model class:
is this somehow possible with the lit-gpt SaveProxyTensor? |
Hey! All your suggestions make sense to me. You should be able to split the combined ff linear as you suggest, especially if load_param has ben called already. We also manipulate the qkv linears for llama2 checkpoints in a similar way. However, note that your workarounds will only work for inference. During training, wte and ff_out will not be tied and the layernorm parameters wont be frozen. |
Hello @rasbt Looks like you are correct. I just wanted to add a couple of things that I've noticed while reviewing their code. For posterity, sorta speak. I also don't like when the layers are initialized not in the order they are executed. Lit-GPT also does it: first we create So the order of execution should be as such:
Olmo has only two layer: def forward(self, x: torch.Tensor) -> torch.Tensor:
x, gate = x.chunk(2, dim=-1)
return F.silu(gate) * x and then they apply Important to note the way they split and then apply activation function on a chunk. That means that: |
Thanks so much for the feedback @carmocca and @Andrei-Aksionov , this was super helpful! After more tinkering, I went with a custom OLMoMLP (analogous to LLaMALMLP) because I thought this was easier than the other workarounds -- both from an implementation perspective but also code-readability in the future. The weights load ok now, but for some reason, the results are garbage. E.g., for python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-1b/
And python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-7b/
|
Actually, upon further inspection they only use weight tying for the 1B (https://huggingface.co/allenai/OLMo-1B/blob/main/config.json#L42) model not for the 7B model (https://huggingface.co/allenai/OLMo-7B/blob/main/config.json#L42). I adjusted the code accordingly. Still not working well though. |
I would strongly prefer that we don't add this new MLP class. To debug the output, you'll have to inspect the activations for both models layer by layer to see where they diverge |
Ok! Maybe let's leave it in there until we got it to work, and then we can refactor it into one of the existing classes somehow. |
Just to add a note about pinpointing the difference. With Carlos's help, we found that the difference currently is in how the QKV matrix is split into queries, keys, and values. https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L195-L202 and https://github.com/allenai/OLMo/blob/main/olmo/model.py#L687 In Lit-GPT, the Q, K, and V are interleaved (to also support MQA) whereas in OLMo, QKV are not interleaved. We could potentially accommodate OLMo in Lit-GPT if we apply the steps here from Llama in the conversion script but in reverse: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L182-L186 |
Adds the popular and fully open-source OLMo models by Allen AI.
generate.py
produces reasonable outputsFixes #925