Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockwise quantization only supports 16/32-bit floats, but got torch.uint8 ( bnb.nf4 quantisation is not working) #1325

Open
Anindyadeep opened this issue Apr 19, 2024 · 15 comments

Comments

@Anindyadeep
Copy link
Contributor

Anindyadeep commented Apr 19, 2024

Hello, I am using the latest version of lit-gpt. First of all, it is much cleaner than before, so amazing work. However I am facing a problem. After I convert a huggingface (Llama 2) model to lit-gpt model, it is running as expected for

  1. float32
  2. float16
  3. int8

But when it comes to int4, I am getting some unexpected error. Here are the logs.

Usage:

I am using the example show in this tutorial (just changed the model path)

litgpt generate base --quantize bnb.nf4 --checkpoint_dir /models/llama-2-7b-chat-litgpt/ --precision bf16-true

And I got this error:

Loading model '/models/llama-2-7b-chat-litgpt/lit_model.pth' with {'name': 'Llama-2-7b-chat-hf', 'hf_config': {'name': 'Llama-2-7b-chat-hf', 'org': 'meta-llama'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.26 seconds.
Traceback (most recent call last):
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
    fn(**kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
    model = fabric.setup_module(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
    module = self._move_model_to_device(model=module, optimizers=[])
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
    model = self.to_device(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
    self._strategy.module_to_device(obj)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
    module.to(self.root_device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
    raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

cc: @aniketmaurya @Andrei-Aksionov

@Andrei-Aksionov
Copy link
Collaborator

Hey @Anindyadeep

Based on the error stacktrace, it looks like you are trying to load and quantize an already quantized model.

Have you done anything to the weights or it's just the weights that were downloaded and converted by LitGPT and nothing more?

@Anindyadeep
Copy link
Contributor Author

Hey thanks for the reply, So the process was:

  1. Use a huggingface model
  2. Then use litgpt litgpt convert to_litgpt --checkpoint_dir command to convert to a litgpt format.

@Andrei-Aksionov
Copy link
Collaborator

Ok, but what the dtype of the HuggingFace model?
If it's already in a quantized form (torch.uint8), then it might explain the error.
You can provide a link to the repo with weights and I'll check it.

@Anindyadeep
Copy link
Contributor Author

Ok, but what the dtype of the HuggingFace model? If it's already in a quantized form (torch.uint8), then it might explain the error. You can provide a link to the repo with weights and I'll check it.

Hi, I see, so here's the thing,

I initially converted the hf weights to int8 using litgpt cli and then I converted the same weights to int4 (which is now not possible), and that is the possible reason. Which means everytime, I need to start with a base litgpt weights (with fp32) or the raw HF weights right?

Let me try that, if it works, I will let you know and then we can close this issue :)

@Andrei-Aksionov
Copy link
Collaborator

Correct. In order to use quantization you just need weights in a standard precision (fp32, fp16, bf16).
When the model is loaded and quantization is specified (e.g. bnb.nf4), the weights are quantized on the fly.

@Anindyadeep
Copy link
Contributor Author

I see, got it, let me try this out, and will keep posted in this thread, thanks for the headsup

@Anindyadeep
Copy link
Contributor Author

Anindyadeep commented Apr 23, 2024

Correct. In order to use quantization you just need weights in a standard precision (fp32, fp16, bf16). When the model is loaded and quantization is specified (e.g. bnb.nf4), the weights are quantized on the fly.

Hi so, I tried the whole process once again. Here is what my Llama 2 weights folder contains after I typed this command:

litgpt convert to_litgpt --checkpoint_dir ./models/Llama-2-7b-chat-hf/

The above run was successful and this is what ./models/Llama-2-7b-chat-hf/ folder contained

models/Llama-2-7b-chat-hf/
├── LICENSE.txt
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── lit_model.pth
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── model_config.yaml
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

Now I typed this command:

litgpt generate base --quantize bnb.nf4 --checkpoint_dir models/Llama-2-7b-chat-hf --precision bf16-true --max_new_tokens 256

And got this error:

Loading model 'models/Llama-2-7b-chat-hf/lit_model.pth' with {'name': 'Llama-2-7b-chat-hf', 'hf_config': {'name': 'Llama-2-7b-chat-hf', 'org': 'meta-llama'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.24 seconds.
Traceback (most recent call last):
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
    fn(**kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
    model = fabric.setup_module(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
    module = self._move_model_to_device(model=module, optimizers=[])
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
    model = self.to_device(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
    self._strategy.module_to_device(obj)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
    module.to(self.root_device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
    raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

@Andrei-Aksionov
Copy link
Collaborator

Andrei-Aksionov commented Apr 23, 2024

But have you checked the dtype of the original weights (.safetensors) in ./models/Llama-2-7b-chat-hf/?

@Anindyadeep
Copy link
Contributor Author

Anindyadeep commented Apr 23, 2024

But have you checked the dtype of the weights in ./models/Llama-2-7b-chat-hf/?

you mean the weights for litgpt model or the hf model? Also as far as the hf models concerned, those are the actual raw weights of llama 2 so it is float16

@Anindyadeep
Copy link
Contributor Author

Here is the HF config

{
  "_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.32.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

And here is the lit model_config.yaml file

bias: false
block_size: 4096
gelu_approximate: none
head_size: 128
hf_config:
  name: Llama-2-7b-chat-hf
  org: meta-llama
intermediate_size: 11008
lm_head_bias: false
mlp_class_name: LLaMAMLP
n_embd: 4096
n_expert: 0
n_expert_per_token: 0
n_head: 32
n_layer: 32
n_query_groups: 32
name: Llama-2-7b-chat-hf
norm_class_name: RMSNorm
norm_eps: 1.0e-05
padded_vocab_size: 32000
padding_multiple: 64
parallel_residual: false
rope_base: 10000
rope_condense_ratio: 1
rotary_percentage: 1.0
scale_embeddings: false
shared_attention_norm: false
vocab_size: 32000

@Anindyadeep
Copy link
Contributor Author

The same is happening for mistral too

@Andrei-Aksionov
Copy link
Collaborator

I still don't have access neither to LlaMA 2 nor to some Mistral models.
But when I tried with Phi 2 everything worked fine.
Here is a code snippet (replace repo_id with the one you want to use):

export repo_id=microsoft/phi-2
litgpt download --repo_id $repo_id --convert_checkpoint false
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256

@Anindyadeep
Copy link
Contributor Author

Okay then let me try the same with Mistral, but this time with download

@Anindyadeep
Copy link
Contributor Author

I still don't have access neither to LlaMA 2 nor to some Mistral models. But when I tried with Phi 2 everything worked fine. Here is a code snippet (replace repo_id with the one you want to use):

export repo_id=microsoft/phi-2
litgpt download --repo_id $repo_id --convert_checkpoint false
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256

I see but I did the same thing for Mistral v0.1. Here are the set of commands:

export repo_id=mistralai/Mistral-7B-Instruct-v0.1
litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256

And here are the logs:

(venv) anindya@prem-ai-a100-fin-01:~/workspace/benchmarks$ litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
(venv) anindya@prem-ai-a100-fin-01:~/workspace/benchmarks$ export repo_id=mistralai/Mistral-7B-Instruct-v0.1
litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256
Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 571/571 [00:00<00:00, 3.37MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 682kB/s]
pytorch_model-00001-of-00002.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.94G/9.94G [01:01<00:00, 162MB/s]
pytorch_model-00002-of-00002.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.06G/5.06G [00:34<00:00, 148MB/s]
pytorch_model.bin.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 72.8MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80M/1.80M [00:00<00:00, 3.94MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 6.39MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.47k/1.47k [00:00<00:00, 10.1MB/s]
Processing checkpoints/mistralai/Mistral-7B-Instruct-v0.1/pytorch_model-00001-of-00002.bin
Loading 'model.embed_tokens.weight' into RAM
Loading 'model.layers.0.self_attn.o_proj.weight' into RAM
Loading 'model.layers.0.mlp.gate_proj.weight' into RAM
Loading 'model.layers.0.mlp.up_proj.weight' into RAM
Loading 'model.layers.0.mlp.down_proj.weight' into RAM
Loading 'model.layers.0.input_layernorm.weight' into RAM
Loading 'model.layers.0.post_attention_layernorm.weight' into RAM
Loading 'model.layers.1.self_attn.o_proj.weight' into RAM
Loading 'model.layers.1.mlp.gate_proj.weight' into RAM
Loading 'model.layers.1.mlp.up_proj.weight' into RAM
Loading 'model.layers.1.mlp.down_proj.weight' into RAM
Loading 'model.layers.1.input_layernorm.weight' into RAM
Loading 'model.layers.1.post_attention_layernorm.weight' into RAM
Loading 'model.layers.2.self_attn.o_proj.weight' into RAM
Loading 'model.layers.2.mlp.gate_proj.weight' into RAM
Loading 'model.layers.2.mlp.up_proj.weight' into RAM
Loading 'model.layers.2.mlp.down_proj.weight' into RAM
Loading 'model.layers.2.input_layernorm.weight' into RAM
Loading 'model.layers.2.post_attention_layernorm.weight' into RAM
Loading 'model.layers.3.self_attn.o_proj.weight' into RAM
Loading 'model.layers.3.mlp.gate_proj.weight' into RAM
Loading 'model.layers.3.mlp.up_proj.weight' into RAM
Loading 'model.layers.3.mlp.down_proj.weight' into RAM
Loading 'model.layers.3.input_layernorm.weight' into RAM
Loading 'model.layers.3.post_attention_layernorm.weight' into RAM
Loading 'model.layers.4.self_attn.o_proj.weight' into RAM
Loading 'model.layers.4.mlp.gate_proj.weight' into RAM
Loading 'model.layers.4.mlp.up_proj.weight' into RAM
Loading 'model.layers.4.mlp.down_proj.weight' into RAM
Loading 'model.layers.4.input_layernorm.weight' into RAM
Loading 'model.layers.4.post_attention_layernorm.weight' into RAM
Loading 'model.layers.5.self_attn.o_proj.weight' into RAM
Loading 'model.layers.5.mlp.gate_proj.weight' into RAM
Loading 'model.layers.5.mlp.up_proj.weight' into RAM
Loading 'model.layers.5.mlp.down_proj.weight' into RAM
Loading 'model.layers.5.input_layernorm.weight' into RAM
Loading 'model.layers.5.post_attention_layernorm.weight' into RAM
Loading 'model.layers.6.self_attn.o_proj.weight' into RAM
Loading 'model.layers.6.mlp.gate_proj.weight' into RAM
Loading 'model.layers.6.mlp.up_proj.weight' into RAM
Loading 'model.layers.6.mlp.down_proj.weight' into RAM
Loading 'model.layers.6.input_layernorm.weight' into RAM
Loading 'model.layers.6.post_attention_layernorm.weight' into RAM
Loading 'model.layers.7.self_attn.o_proj.weight' into RAM
Loading 'model.layers.7.mlp.gate_proj.weight' into RAM
Loading 'model.layers.7.mlp.up_proj.weight' into RAM
Loading 'model.layers.7.mlp.down_proj.weight' into RAM
Loading 'model.layers.7.input_layernorm.weight' into RAM
Loading 'model.layers.7.post_attention_layernorm.weight' into RAM
Loading 'model.layers.8.self_attn.o_proj.weight' into RAM
Loading 'model.layers.8.mlp.gate_proj.weight' into RAM
Loading 'model.layers.8.mlp.up_proj.weight' into RAM
Loading 'model.layers.8.mlp.down_proj.weight' into RAM
Loading 'model.layers.8.input_layernorm.weight' into RAM
Loading 'model.layers.8.post_attention_layernorm.weight' into RAM
Loading 'model.layers.9.self_attn.o_proj.weight' into RAM
Loading 'model.layers.9.mlp.gate_proj.weight' into RAM
Loading 'model.layers.9.mlp.up_proj.weight' into RAM
Loading 'model.layers.9.mlp.down_proj.weight' into RAM
Loading 'model.layers.9.input_layernorm.weight' into RAM
Loading 'model.layers.9.post_attention_layernorm.weight' into RAM
Loading 'model.layers.10.self_attn.o_proj.weight' into RAM
Loading 'model.layers.10.mlp.gate_proj.weight' into RAM
Loading 'model.layers.10.mlp.up_proj.weight' into RAM
Loading 'model.layers.10.mlp.down_proj.weight' into RAM
Loading 'model.layers.10.input_layernorm.weight' into RAM
Loading 'model.layers.10.post_attention_layernorm.weight' into RAM
Loading 'model.layers.11.self_attn.o_proj.weight' into RAM
Loading 'model.layers.11.mlp.gate_proj.weight' into RAM
Loading 'model.layers.11.mlp.up_proj.weight' into RAM
Loading 'model.layers.11.mlp.down_proj.weight' into RAM
Loading 'model.layers.11.input_layernorm.weight' into RAM
Loading 'model.layers.11.post_attention_layernorm.weight' into RAM
Loading 'model.layers.12.self_attn.o_proj.weight' into RAM
Loading 'model.layers.12.mlp.gate_proj.weight' into RAM
Loading 'model.layers.12.mlp.up_proj.weight' into RAM
Loading 'model.layers.12.mlp.down_proj.weight' into RAM
Loading 'model.layers.12.input_layernorm.weight' into RAM
Loading 'model.layers.12.post_attention_layernorm.weight' into RAM
Loading 'model.layers.13.self_attn.o_proj.weight' into RAM
Loading 'model.layers.13.mlp.gate_proj.weight' into RAM
Loading 'model.layers.13.mlp.up_proj.weight' into RAM
Loading 'model.layers.13.mlp.down_proj.weight' into RAM
Loading 'model.layers.13.input_layernorm.weight' into RAM
Loading 'model.layers.13.post_attention_layernorm.weight' into RAM
Loading 'model.layers.14.self_attn.o_proj.weight' into RAM
Loading 'model.layers.14.mlp.gate_proj.weight' into RAM
Loading 'model.layers.14.mlp.up_proj.weight' into RAM
Loading 'model.layers.14.mlp.down_proj.weight' into RAM
Loading 'model.layers.14.input_layernorm.weight' into RAM
Loading 'model.layers.14.post_attention_layernorm.weight' into RAM
Loading 'model.layers.15.self_attn.o_proj.weight' into RAM
Loading 'model.layers.15.mlp.gate_proj.weight' into RAM
Loading 'model.layers.15.mlp.up_proj.weight' into RAM
Loading 'model.layers.15.mlp.down_proj.weight' into RAM
Loading 'model.layers.15.input_layernorm.weight' into RAM
Loading 'model.layers.15.post_attention_layernorm.weight' into RAM
Loading 'model.layers.16.self_attn.o_proj.weight' into RAM
Loading 'model.layers.16.mlp.gate_proj.weight' into RAM
Loading 'model.layers.16.mlp.up_proj.weight' into RAM
Loading 'model.layers.16.mlp.down_proj.weight' into RAM
Loading 'model.layers.16.input_layernorm.weight' into RAM
Loading 'model.layers.16.post_attention_layernorm.weight' into RAM
Loading 'model.layers.17.self_attn.o_proj.weight' into RAM
Loading 'model.layers.17.mlp.gate_proj.weight' into RAM
Loading 'model.layers.17.mlp.up_proj.weight' into RAM
Loading 'model.layers.17.mlp.down_proj.weight' into RAM
Loading 'model.layers.17.input_layernorm.weight' into RAM
Loading 'model.layers.17.post_attention_layernorm.weight' into RAM
Loading 'model.layers.18.self_attn.o_proj.weight' into RAM
Loading 'model.layers.18.mlp.gate_proj.weight' into RAM
Loading 'model.layers.18.mlp.up_proj.weight' into RAM
Loading 'model.layers.18.mlp.down_proj.weight' into RAM
Loading 'model.layers.18.input_layernorm.weight' into RAM
Loading 'model.layers.18.post_attention_layernorm.weight' into RAM
Loading 'model.layers.19.self_attn.o_proj.weight' into RAM
Loading 'model.layers.19.mlp.gate_proj.weight' into RAM
Loading 'model.layers.19.mlp.up_proj.weight' into RAM
Loading 'model.layers.19.mlp.down_proj.weight' into RAM
Loading 'model.layers.19.input_layernorm.weight' into RAM
Loading 'model.layers.19.post_attention_layernorm.weight' into RAM
Loading 'model.layers.20.self_attn.o_proj.weight' into RAM
Loading 'model.layers.20.mlp.gate_proj.weight' into RAM
Loading 'model.layers.20.mlp.up_proj.weight' into RAM
Loading 'model.layers.20.mlp.down_proj.weight' into RAM
Loading 'model.layers.20.input_layernorm.weight' into RAM
Loading 'model.layers.20.post_attention_layernorm.weight' into RAM
Loading 'model.layers.21.self_attn.o_proj.weight' into RAM
Loading 'model.layers.21.mlp.gate_proj.weight' into RAM
Loading 'model.layers.21.mlp.up_proj.weight' into RAM
Loading 'model.layers.21.mlp.down_proj.weight' into RAM
Loading 'model.layers.21.input_layernorm.weight' into RAM
Loading 'model.layers.21.post_attention_layernorm.weight' into RAM
Loading 'model.layers.22.self_attn.o_proj.weight' into RAM
Loading 'layer 0 q' into RAM
Loading 'layer 0 k' into RAM
Loading 'layer 0 v' into RAM
Loading 'layer 1 q' into RAM
Loading 'layer 1 k' into RAM
Loading 'layer 1 v' into RAM
Loading 'layer 2 q' into RAM
Loading 'layer 2 k' into RAM
Loading 'layer 2 v' into RAM
Loading 'layer 3 q' into RAM
Loading 'layer 3 k' into RAM
Loading 'layer 3 v' into RAM
Loading 'layer 4 q' into RAM
Loading 'layer 4 k' into RAM
Loading 'layer 4 v' into RAM
Loading 'layer 5 q' into RAM
Loading 'layer 5 k' into RAM
Loading 'layer 5 v' into RAM
Loading 'layer 6 q' into RAM
Loading 'layer 6 k' into RAM
Loading 'layer 6 v' into RAM
Loading 'layer 7 q' into RAM
Loading 'layer 7 k' into RAM
Loading 'layer 7 v' into RAM
Loading 'layer 8 q' into RAM
Loading 'layer 8 k' into RAM
Loading 'layer 8 v' into RAM
Loading 'layer 9 q' into RAM
Loading 'layer 9 k' into RAM
Loading 'layer 9 v' into RAM
Loading 'layer 10 q' into RAM
Loading 'layer 10 k' into RAM
Loading 'layer 10 v' into RAM
Loading 'layer 11 q' into RAM
Loading 'layer 11 k' into RAM
Loading 'layer 11 v' into RAM
Loading 'layer 12 q' into RAM
Loading 'layer 12 k' into RAM
Loading 'layer 12 v' into RAM
Loading 'layer 13 q' into RAM
Loading 'layer 13 k' into RAM
Loading 'layer 13 v' into RAM
Loading 'layer 14 q' into RAM
Loading 'layer 14 k' into RAM
Loading 'layer 14 v' into RAM
Loading 'layer 15 q' into RAM
Loading 'layer 15 k' into RAM
Loading 'layer 15 v' into RAM
Loading 'layer 16 q' into RAM
Loading 'layer 16 k' into RAM
Loading 'layer 16 v' into RAM
Loading 'layer 17 q' into RAM
Loading 'layer 17 k' into RAM
Loading 'layer 17 v' into RAM
Loading 'layer 18 q' into RAM
Loading 'layer 18 k' into RAM
Loading 'layer 18 v' into RAM
Loading 'layer 19 q' into RAM
Loading 'layer 19 k' into RAM
Loading 'layer 19 v' into RAM
Loading 'layer 20 q' into RAM
Loading 'layer 20 k' into RAM
Loading 'layer 20 v' into RAM
Loading 'layer 21 q' into RAM
Loading 'layer 21 k' into RAM
Loading 'layer 21 v' into RAM
Loading 'layer 22 q' into RAM
Loading 'layer 22 k' into RAM
Loading 'layer 22 v' into RAM
Processing checkpoints/mistralai/Mistral-7B-Instruct-v0.1/pytorch_model-00002-of-00002.bin
Loading 'model.layers.22.mlp.gate_proj.weight' into RAM
Loading 'model.layers.22.mlp.up_proj.weight' into RAM
Loading 'model.layers.22.mlp.down_proj.weight' into RAM
Loading 'model.layers.22.input_layernorm.weight' into RAM
Loading 'model.layers.22.post_attention_layernorm.weight' into RAM
Loading 'model.layers.23.self_attn.o_proj.weight' into RAM
Loading 'model.layers.23.mlp.gate_proj.weight' into RAM
Loading 'model.layers.23.mlp.up_proj.weight' into RAM
Loading 'model.layers.23.mlp.down_proj.weight' into RAM
Loading 'model.layers.23.input_layernorm.weight' into RAM
Loading 'model.layers.23.post_attention_layernorm.weight' into RAM
Loading 'model.layers.24.self_attn.o_proj.weight' into RAM
Loading 'model.layers.24.mlp.gate_proj.weight' into RAM
Loading 'model.layers.24.mlp.up_proj.weight' into RAM
Loading 'model.layers.24.mlp.down_proj.weight' into RAM
Loading 'model.layers.24.input_layernorm.weight' into RAM
Loading 'model.layers.24.post_attention_layernorm.weight' into RAM
Loading 'model.layers.25.self_attn.o_proj.weight' into RAM
Loading 'model.layers.25.mlp.gate_proj.weight' into RAM
Loading 'model.layers.25.mlp.up_proj.weight' into RAM
Loading 'model.layers.25.mlp.down_proj.weight' into RAM
Loading 'model.layers.25.input_layernorm.weight' into RAM
Loading 'model.layers.25.post_attention_layernorm.weight' into RAM
Loading 'model.layers.26.self_attn.o_proj.weight' into RAM
Loading 'model.layers.26.mlp.gate_proj.weight' into RAM
Loading 'model.layers.26.mlp.up_proj.weight' into RAM
Loading 'model.layers.26.mlp.down_proj.weight' into RAM
Loading 'model.layers.26.input_layernorm.weight' into RAM
Loading 'model.layers.26.post_attention_layernorm.weight' into RAM
Loading 'model.layers.27.self_attn.o_proj.weight' into RAM
Loading 'model.layers.27.mlp.gate_proj.weight' into RAM
Loading 'model.layers.27.mlp.up_proj.weight' into RAM
Loading 'model.layers.27.mlp.down_proj.weight' into RAM
Loading 'model.layers.27.input_layernorm.weight' into RAM
Loading 'model.layers.27.post_attention_layernorm.weight' into RAM
Loading 'model.layers.28.self_attn.o_proj.weight' into RAM
Loading 'model.layers.28.mlp.gate_proj.weight' into RAM
Loading 'model.layers.28.mlp.up_proj.weight' into RAM
Loading 'model.layers.28.mlp.down_proj.weight' into RAM
Loading 'model.layers.28.input_layernorm.weight' into RAM
Loading 'model.layers.28.post_attention_layernorm.weight' into RAM
Loading 'model.layers.29.self_attn.o_proj.weight' into RAM
Loading 'model.layers.29.mlp.gate_proj.weight' into RAM
Loading 'model.layers.29.mlp.up_proj.weight' into RAM
Loading 'model.layers.29.mlp.down_proj.weight' into RAM
Loading 'model.layers.29.input_layernorm.weight' into RAM
Loading 'model.layers.29.post_attention_layernorm.weight' into RAM
Loading 'model.layers.30.self_attn.o_proj.weight' into RAM
Loading 'model.layers.30.mlp.gate_proj.weight' into RAM
Loading 'model.layers.30.mlp.up_proj.weight' into RAM
Loading 'model.layers.30.mlp.down_proj.weight' into RAM
Loading 'model.layers.30.input_layernorm.weight' into RAM
Loading 'model.layers.30.post_attention_layernorm.weight' into RAM
Loading 'model.layers.31.self_attn.o_proj.weight' into RAM
Loading 'model.layers.31.mlp.gate_proj.weight' into RAM
Loading 'model.layers.31.mlp.up_proj.weight' into RAM
Loading 'model.layers.31.mlp.down_proj.weight' into RAM
Loading 'model.layers.31.input_layernorm.weight' into RAM
Loading 'model.layers.31.post_attention_layernorm.weight' into RAM
Loading 'model.norm.weight' into RAM
Loading 'lm_head.weight' into RAM
Loading 'layer 23 q' into RAM
Loading 'layer 23 k' into RAM
Loading 'layer 23 v' into RAM
Loading 'layer 24 q' into RAM
Loading 'layer 24 k' into RAM
Loading 'layer 24 v' into RAM
Loading 'layer 25 q' into RAM
Loading 'layer 25 k' into RAM
Loading 'layer 25 v' into RAM
Loading 'layer 26 q' into RAM
Loading 'layer 26 k' into RAM
Loading 'layer 26 v' into RAM
Loading 'layer 27 q' into RAM
Loading 'layer 27 k' into RAM
Loading 'layer 27 v' into RAM
Loading 'layer 28 q' into RAM
Loading 'layer 28 k' into RAM
Loading 'layer 28 v' into RAM
Loading 'layer 29 q' into RAM
Loading 'layer 29 k' into RAM
Loading 'layer 29 v' into RAM
Loading 'layer 30 q' into RAM
Loading 'layer 30 k' into RAM
Loading 'layer 30 v' into RAM
Loading 'layer 31 q' into RAM
Loading 'layer 31 k' into RAM
Loading 'layer 31 v' into RAM
Saving converted checkpoint to checkpoints/mistralai/Mistral-7B-Instruct-v0.1
Loading model 'checkpoints/mistralai/Mistral-7B-Instruct-v0.1/lit_model.pth' with {'name': 'Mistral-7B-Instruct-v0.1', 'hf_config': {'name': 'Mistral-7B-Instruct-v0.1', 'org': 'mistralai'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 512, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 14336, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.26 seconds.
Traceback (most recent call last):
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
    fn(**kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
    model = fabric.setup_module(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
    module = self._move_model_to_device(model=module, optimizers=[])
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
    model = self.to_device(model)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
    self._strategy.module_to_device(obj)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
    module.to(self.root_device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
    raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

@Andrei-Aksionov
Copy link
Collaborator

@carmocca Could you look into it? Since I don't have access to the model, I cannot even reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants