Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop interleave placement in QKV matrix #1013

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Andrei-Aksionov
Copy link
Collaborator

@Andrei-Aksionov Andrei-Aksionov commented Mar 5, 2024

Hi there 👋

This PR changes placement of query, key and value weights in the combined QKV matrix.
Intuitively one might assume that weights in the QKV matrix are combined in a sequential order $[Q,Q,...,K,K,...,V,V,...]$, but in-fact they are currently placed in interleaved one $[Q, K, V, Q, K, V, ..., Q, K, V]$.
I believe this placement was introduced by models that used GPTNeoX MLP layers (Pythia and Falcon), but all the latest models doesn't use such interleaved placement.

That means that:

  1. For a novice ML engineer, who has just started learning LLMs and familiar only with LlaMA-2 based models, that placement might be confusing.
  2. Since it's not that popular amongst newer models, it might hurt "hackability".
  3. Might introduce errors, like we already had not so recently in Question about LoRA implementation when loading Llama2 checkpoint #891
  4. If the repo is planned to support loading/uploading quantized weights from/to HF hub, then someone has to deal with the changed shape of the quantized weights (due to “packing”) and such a conversion might be risky.

Benchmarks

Benchmarks were done on a 1xA10G with the code from main and this PR branches.
Train was done 5 times for each model/branch and the results were averaged. Args: 1 epoch, epoch size is 1k. Other args were by default.
Generate was done 100 times and Genereate with max_new_tokens=1950 - 10 times.

Models: two variants with 1 billion of parameters, simply to speed up benchmarks.
Pythia represents a model with multi-head attention, while TinyLlaMA - grouped-query attention. For multi-query attention there should not be any significant difference.

Mode Model Time $_{main}$ Time $_{PR}$ Token_sec $_{main}$ Token_sec $_{PR}$ VRAM, GB $_{main}$ VRAM, GB $_{PR}$
Train Pythia-1b 33.80 33.61 - - 15.39 15.39
TinyLlaMA-1.1b 44.71 43.43 - - 18.14 18.16
Generate Pythia-1b - - 106.31 108.89 2.04 2.04
TinyLlaMA-1.1b - - 57.12 58.03 2.22 2.21
Generate with max_new_tokens=1950 TinyLlaMA-1.1b - - 57.80 59.03 2.57 2.28

In training mode the PR version is slightly faster, but for TinyLlama VRAM consumption was slightly larger, oddly enough. I tried to do unsqueeze(...).expand(...).reshape(...) but the VRAM consumption stayed the same.

In inference mode the PR version is again faster, but the biggest difference is in VRAM consumption, since we don't need to cache KV after the expansion, like it's done in the current main.

@rasbt
Copy link
Collaborator

rasbt commented Mar 5, 2024

This might actually fix my OLMo PR in #927 😅

@carmocca
Copy link
Member

carmocca commented Mar 5, 2024

You might want to merge OLMo with an interleaving conversion step because this PR is very risky and a breaking change for all existing checkpoints

@Andrei-Aksionov
Copy link
Collaborator Author

@carmocca
Copy link
Member

carmocca commented Mar 6, 2024

@Andrei-Aksionov We need to evaluate if we want to make this change. Especially if there are any performance differences and whether the risk is worth it. But there are two important considerations:

  • This needs to provide conversion logic since all existing checkpoints (from HF, pretrained, or fine-tuned) will be broken. Extending what Adrian is doing in Checkpoint upgrade utility #1019.
  • This would need to finish in 1-2 weeks to be released together with the wip branch, where we are making also breaking changes.

@Andrei-Aksionov
Copy link
Collaborator Author

Andrei-Aksionov commented Mar 6, 2024

These are all good questions.

We need to evaluate if we want to make this change

  1. I believe all the latest models doesn't have an interleaved placement in QKV matrix, it's becoming uncommon, so it can hurt "hackability". Plus, if someone just started learning how LLMs work and uses this repo as a “gateway” to this wonderful world (full of unicorns and double rainbows), then this placement might be confusing.
  2. Interleaved placement might introduce unexpected problems, like with LoRA indices that we had a couple of weeks ago.
  3. If we decide to support loading/downloading GPTQ weights from HF (after the AutoGPTQ PR is merged), then interleaved placement might make the conversion step much-much harder. After quantization the shape is not as for original weights, and we will have to very carefully split and merge weights. This is what I call “risky business”.

Especially if there are any performance differences

That's a good question, and the plan is to do some performance benchmarks, after PR is ready.
One of the concerns is that the current code for grouped queries uses .expand method, which just creates another view and should be definitely faster and more memory efficient than .repeat_interleave. But, .expand makes the tensor non-contiguous and thus .reshape method will create a copy of the tensor.
https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/model.py#L211-L217

So, in overall, the performance should be in the same ballpark, but I'll verify it with benchmarks, to be on the safe side.

Additionally, the current code also caches keys and values after they are expanded, making it not very efficient.
The code in PR doesn't do it, but at the same time I have an error with KV-cache, so we shall see whether I can make it work or not.

This needs to provide conversion logic since all existing checkpoints (from HF, pretrained, or fine-tuned) will be broken. Extending what Adrian is doing in #1019.

Yep, agree.

This would need to finish in 1-2 weeks to be released together with the wip branch, where we are making also breaking changes.

I think I can make it by the end of Friday. 🤞

@Andrei-Aksionov Andrei-Aksionov marked this pull request as ready for review March 9, 2024 15:30
@Andrei-Aksionov
Copy link
Collaborator Author

One recommendation, if we want to merge it in some future.
Since we need to deal with legacy checkpoints and somehow determine if we need to reassemble it to "non-interleaved" variant, I think we can rename the linear layer from attn to qkv.
Plus, it will make the naming much cleaner. Right now layers for SelfAttention class are:

  • transformer.h.0.attn.attn
  • transformer.h.0.attn.proj

In my opinion, this variant looks better:

  • transformer.h.0.attn.qkv
  • transformer.h.0.attn.proj

Looks like a win-win situation.

@carmocca
Copy link
Member

carmocca commented May 6, 2024

The name is directly inherited from https://github.com/karpathy/nanoGPT/blob/master/model.py#L35. We took the liberty of dropping out the convolutional past "c_"

@Andrei-Aksionov
Copy link
Collaborator Author

I've just called Andrej, and he doesn't mind if we rename it to qkv.

@carmocca
Copy link
Member

carmocca commented May 6, 2024

Hope he approves the PR then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants