Update DML version in LLM example (#1098)

## Describe your changes ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link
microsoft · Apr 21, 2024 · 4e23c4c · 4e23c4c
1 parent 04b4b2c
commit 4e23c4c
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 2 deletions.
diff --git a/examples/directml/llm/README.md b/examples/directml/llm/README.md
@@ -21,6 +21,7 @@ pip install -e .
 ```
 cd Olive/examples/directml/llm
 pip install -r requirements.txt
+pip install ort-nightly-directml==1.18.0.dev20240419003 --extra-index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/
 ```
 
 3. (Only for LLaMA 2) Request access to the LLaMA 2 weights at the HuggingFace's [llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) or [llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) repositories.
@@ -52,6 +53,8 @@ The first time this script is invoked can take some time since it will need to d
 
 Once the script successfully completes, the optimized ONNX pipeline will be stored under `models/optimized/<model_name>`.
 
+Note: When converting mistral, you will see the following error: `failed in shape inference <class 'AssertionError'>`. This is caused by Multi Query Attention not being supported by the `MultiHeadAttention` operator, but in our case it doesn't matter since it will be converted to `GroupQueryAttention` at the end of the optimization process. You can safely ignore this error.
+
 If you only want to run the inference sample (possible after the model has been optimized), run the `run_llm_io_binding.py` helper script:
 
 ```

diff --git a/examples/directml/llm/llm.py b/examples/directml/llm/llm.py
@@ -17,7 +17,6 @@
 import config
 import torch
 import transformers
-from chat_app.app import launch_chat_app
 from huggingface_hub import hf_hub_download
 from model_type_mapping import (
     get_all_supported_models,
@@ -360,6 +359,8 @@ def main():
 
     if not args.optimize:
         if args.interactive:
+            from chat_app.app import launch_chat_app
+
             launch_chat_app(args.expose_locally)
         else:
             with warnings.catch_warnings():

diff --git a/examples/directml/llm/requirements.txt b/examples/directml/llm/requirements.txt
@@ -2,7 +2,6 @@ huggingface-hub
 markdown
 mdtex2html
 neural-compressor
-onnxruntime-directml>=1.17.4
 optimum
 protobuf==3.20.3 # protobuf 4.x aborts with OOM when optimizing large models
 Pygments