Skip to content

Commit

Permalink
Update DML version in LLM example (#1098)
Browse files Browse the repository at this point in the history
## Describe your changes

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.
- [ ] Is this PR including examples changes? If yes, please remember to
update [example
documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md)
in a follow-up PR.

## (Optional) Issue link
  • Loading branch information
PatriceVignola committed Apr 21, 2024
1 parent 04b4b2c commit 4e23c4c
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 2 deletions.
3 changes: 3 additions & 0 deletions examples/directml/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ pip install -e .
```
cd Olive/examples/directml/llm
pip install -r requirements.txt
pip install ort-nightly-directml==1.18.0.dev20240419003 --extra-index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/
```

3. (Only for LLaMA 2) Request access to the LLaMA 2 weights at the HuggingFace's [llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) or [llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) repositories.
Expand Down Expand Up @@ -52,6 +53,8 @@ The first time this script is invoked can take some time since it will need to d

Once the script successfully completes, the optimized ONNX pipeline will be stored under `models/optimized/<model_name>`.

Note: When converting mistral, you will see the following error: `failed in shape inference <class 'AssertionError'>`. This is caused by Multi Query Attention not being supported by the `MultiHeadAttention` operator, but in our case it doesn't matter since it will be converted to `GroupQueryAttention` at the end of the optimization process. You can safely ignore this error.

If you only want to run the inference sample (possible after the model has been optimized), run the `run_llm_io_binding.py` helper script:

```
Expand Down
3 changes: 2 additions & 1 deletion examples/directml/llm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
import config
import torch
import transformers
from chat_app.app import launch_chat_app
from huggingface_hub import hf_hub_download
from model_type_mapping import (
get_all_supported_models,
Expand Down Expand Up @@ -360,6 +359,8 @@ def main():

if not args.optimize:
if args.interactive:
from chat_app.app import launch_chat_app

launch_chat_app(args.expose_locally)
else:
with warnings.catch_warnings():
Expand Down
1 change: 0 additions & 1 deletion examples/directml/llm/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ huggingface-hub
markdown
mdtex2html
neural-compressor
onnxruntime-directml>=1.17.4
optimum
protobuf==3.20.3 # protobuf 4.x aborts with OOM when optimizing large models
Pygments
Expand Down

0 comments on commit 4e23c4c

Please sign in to comment.