New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phi-3 checkpoint #1341
base: main
Are you sure you want to change the base?
Add phi-3 checkpoint #1341
Conversation
rasbt
commented
Apr 23, 2024
•
edited
edited
- Verify Phi-3-mini-4k-instruct configs
- Add prompt style
- Add other config files
- Add test_model.py
- Add to test_prompts.py
- Update 2 tables in README
- Update download_model_weights.md
There is a modeling_*.py file. |
Haha, I finally get the weights loaded but of course it's never easy ... of course it's generating gibberish
Let the easter egg hunt begin 😭 |
Some more tidbits via Daniel Han:
|
Ok, it's becoming more interesting. |
@@ -298,6 +298,20 @@ def forward(self, x: torch.Tensor) -> torch.Tensor: | |||
return self.proj(x) | |||
|
|||
|
|||
class Phi3MLP(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to not need this class at all and instead reshape the weights for LLaMAMLP in the checkpoint conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree. I was thinking the same thing. Similar to OLMo, I was hoping to get it working first and then simplify from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New models by Apple have if-else statements for this case: https://huggingface.co/apple/OpenELM-270M-Instruct/blob/main/modeling_openelm.py#L405-L462
For simplicity we definitely shouldn't do the same.
Looks like the sliding window number was a typo: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/commit/b043e05a86cfc77f8d53eb0edf6a33e39afbcb5e |
Current code is an ugly state, but at least the model produces the same output as HF one. The missing piece is the Tokenizer: it has a smaller vocab size (32k vs 50k) that was extended by 64 special tokens. |
Yeah, that sounds about right based on the Phi-3 paper:
|