Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run symbolic shape inference when doing LLM Optimization with DirectML #1093

Open
jojo1899 opened this issue Apr 18, 2024 · 8 comments

Comments

@jojo1899
Copy link

jojo1899 commented Apr 18, 2024

Describe the bug
I am trying to run the code in LLM Optimization with DirectML. The requirements.txt file says onnxruntime-directml>=1.17.4. Is there a typo in that? The latest version seems to be onnxruntime-directml 1.17.3. Executing pip install -r requirements.txt results in the following error.

ERROR: Could not find a version that satisfies the requirement onnxruntime-directml>=1.17.4 (from versions: 1.9.0, 1.10.0, 1.11.0, 1.11.1, 1.12.0, 1.12.1, 1.13.1, 1.14.0, 1.14.1, 1.15.0, 1.15.1, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.17.0, 1.17.1, 1.17.3)
ERROR: No matching distribution found for onnxruntime-directml>=1.17.4

I continued running the code with onnxruntime-directml 1.17.3. However, the LLM Optimization with DirectML does not run as expected when the following is executed: python llm.py --model_type=mistral-7b-chat.
It Failed to run symbolic shape inference. It then Failed to run Olive on gpu-dml. The Traceback is pasted in the Olive logs below.

To Reproduce
python llm.py --model_type=mistral-7b-chat

Expected behavior
Expected the code to run without any errors

Olive config
Add Olive configurations here.

Olive logs

>python llm.py --model_type=mistral-7b-chat
Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.40s/it]
[2024-04-18 17:21:47,163] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Olive\olive\olive_config.json
[2024-04-18 17:21:47,322] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-18 17:21:47,322] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-18 17:21:47,322] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-18 17:21:47,343] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-18 17:28:25,784] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 398.437406 seconds
[2024-04-18 17:28:25,784] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-18 17:44:22,031] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-18 17:44:34,625] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 968.824505 seconds
[2024-04-18 17:44:34,647] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-18 17:44:35,300] [WARNING] [engine.py:357:run_accelerator] Failed to run Olive on gpu-dml.
Traceback (most recent call last):
  File "C:\Olive\olive\engine\engine.py", line 346, in run_accelerator
    output_footprint = self.run_search(
  File "C:\Olive\olive\engine\engine.py", line 531, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Olive\olive\engine\engine.py", line 843, in _run_passes
    signal = self._evaluate_model(model_config, model_id, data_root, evaluator_config, accelerator_spec)
  File "C:\Olive\olive\engine\engine.py", line 1041, in _evaluate_model
    signal = self.target.evaluate_model(model_config, data_root, metrics, accelerator_spec)
  File "C:\Olive\olive\systems\local.py", line 46, in evaluate_model
    return evaluator.evaluate(model, data_root, metrics, device=device, execution_providers=execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 214, in evaluate
    metrics_res[metric.name] = self._evaluate_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 132, in _evaluate_latency
    latencies = self._evaluate_raw_latency(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 767, in _evaluate_raw_latency
    return self._evaluate_onnx_latency(model, metric, dataloader, post_func, device, execution_providers)
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 540, in _evaluate_onnx_latency
    session, inference_settings = OnnxEvaluator.get_session_wrapper(
  File "C:\Olive\olive\evaluator\olive_evaluator.py", line 435, in get_session_wrapper
    session = model.prepare_session(
  File "C:\Olive\olive\model\handler\onnx.py", line 114, in prepare_session
    return get_ort_inference_session(
  File "C:\Olive\olive\common\ort_inference.py", line 118, in get_ort_inference_session
    session = ort.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Type Error: Type 'tensor(float)' of input parameter (InsertedPrecisionFreeCast_/model/layers.0/self_attn/rotary_embedding/Add_output_0) of operator (GroupQueryAttention) in node (GroupQueryAttention_0) is invalid.
[2024-04-18 17:44:35,380] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-18 17:44:35,459] [INFO] [engine.py:567:dump_run_history] run history:
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| model_id                                                                           | parent_model_id                                                                    | from_pass                   |   duration_sec | metrics   |
+====================================================================================+====================================================================================+=============================+================+===========+
| ce39a7112b2825df5404fbb628c489ab                                                   |                                                                                    |                             |                |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-dfaff1da61d127bb5e9dc2f31a708897 | ce39a7112b2825df5404fbb628c489ab                                                   | OnnxConversion              |        398.437 |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
| 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml           | 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-dfaff1da61d127bb5e9dc2f31a708897 | OrtTransformersOptimization |        968.825 |           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+
[2024-04-18 17:44:35,459] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\llm.py", line 390, in <module>
    main()
  File "C:\Olive\examples\directml\llm\llm.py", line 350, in main
    optimize(
  File "C:\Olive\examples\directml\llm\llm.py", line 237, in optimize
    with footprints_file_path.open("r") as footprint_file:
  File "C:\Anaconda\envs\myolive\lib\pathlib.py", line 1252, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "C:\Anaconda\envs\myolive\lib\pathlib.py", line 1120, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Olive\\examples\\directml\\llm\\footprints\\mistralai_Mistral-7B-Instruct-v0.1_gpu-dml_footprints.json'

Other information

  • OS: Windows 11
  • Olive version: olive-ai 0.6.0
  • ONNXRuntime package and version: onnxruntime-gpu 1.17.1

Additional context
Add any other context about the problem here.

@PatriceVignola
Copy link
Contributor

Hi @jojo1899,

This sample requires a future version of onnxruntime-directml (tentatively named 1.17.4 as you've seen in the requirements) to run. This new version should be out very soon and, at the very least, you should be able to use a nightly build soon to run this sample.

@jojo1899
Copy link
Author

jojo1899 commented Apr 19, 2024

@PatriceVignola Thanks for the information.
I tried executing the code again, twice, with different nightly builds: ort-nightly-directml 1.18.0.dev20240117005 (Jan 18 version), and ort-nightly-directml==1.18.0.dev20240417007 (Apr 18 version). I get the same error as with onnxruntime-directml 1.17.3. Is that strange or as expected?

@PatriceVignola
Copy link
Contributor

@jojo1899 Yes, this is expected. You can keep an eye on the 2 following PRs which are required to run this sample:

microsoft/onnxruntime#20308
microsoft/onnxruntime#20327

Once they are merged in (which will 100% be today), it will take one or 2 days to make it into a nightly build. I expect the next nightly build to have the changes. I will update the requirements once that build has been generated.

@jambayk jambayk linked a pull request Apr 21, 2024 that will close this issue
6 tasks
@jambayk jambayk removed a link to a pull request Apr 21, 2024
6 tasks
@PatriceVignola
Copy link
Contributor

PatriceVignola commented Apr 21, 2024

Hi @jojo1899, we just updated the LLM sample to add the correct version of onnxruntime DirectML to use. You can simpley run

pip install ort-nightly-directml==1.18.0.dev20240419003 --extra-index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/

Note that when converting Mistral, you will still see the failed in shape inference <class 'AssertionError'> error but it is a false positive (there's a full explanation in the README). The optimization process should still successfully complete unless you run out of memory.

@jojo1899
Copy link
Author

jojo1899 commented Apr 22, 2024

I tried running the code again. I get the following error when quantizing the model using AWQ.

2024-04-22 11:11:21 [INFO] Quantize the model with default config.
Progress: [                    ] 0.78%Running model for sample 0
Running model for sample 1
2024-04-22 11:11:55 [ERROR] Unexpected exception Fail('[ONNXRuntimeError] : 1 : FAIL : C:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\dml\\DmlExecutionProvider\\src\\DmlCommandRecorder.cpp(371)\\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.\r\n') happened during tuning.

The following are some details.
My GPU: Nvidia 4070 Ti Super
Installed Pytorch using: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

I haven't tried running the model without quantizing it, but I will do that in a while and give an update.

I have a question about the following warning in the log:
[WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
Isn't DirectML EP supposed to work with GPUs? Why does it require an NPU?

Here is the log

C:\Olive\examples\directml\llm>python llm.py --model_type=mistral-7b-chat --quant_strategy=awq

Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.20s/it]
[2024-04-22 10:47:03,568] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Anaconda\envs\myolive\lib\site-packages\olive\olive_config.json
[2024-04-22 10:47:03,727] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-22 10:47:03,727] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-22 10:47:03,727] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-22 10:47:06,491] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-22 10:53:13,805] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 367.314774 seconds
[2024-04-22 10:53:13,817] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-22 11:06:46,649] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-22 11:06:58,912] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 825.090703 seconds
[2024-04-22 11:06:58,928] [INFO] [engine.py:864:_run_pass] Running pass quantize:IncStaticQuantization
[2024-04-22 11:07:02,840] [WARNING] [inc_quantization.py:440:_set_tuning_config] 'metric' is not set for INC Quantization Pass. Intel® Neural Compressor will quantize model without accuracy aware tuning. Please set 'metric' if you want to use Intel® Neural Compressorquantization with accuracy aware tuning.
2024-04-22 11:11:08 [INFO] Start auto tuning.
2024-04-22 11:11:08 [INFO] Quantize model without tuning!
2024-04-22 11:11:08 [INFO] Quantize the model with default configuration without evaluating the model.                To perform the tuning process, please either provide an eval_func or provide an                    eval_dataloader an eval_metric.
2024-04-22 11:11:08 [INFO] Adaptor has 5 recipes.
2024-04-22 11:11:08 [INFO] 0 recipes specified by user.
2024-04-22 11:11:08 [INFO] 3 recipes require future tuning.
2024-04-22 11:11:08 [WARNING] Backend `onnxrt_dml_ep` requires a NPU device. Reset device to 'npu'.
2024-04-22 11:11:08 [INFO] *** Initialize auto tuning
Exception in thread Thread-4:
2024-04-22 11:11:08 [INFO] {
Traceback (most recent call last):
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 980, in _bootstrap_inner
2024-04-22 11:11:08 [INFO]     'PostTrainingQuantConfig': {
2024-04-22 11:11:08 [INFO]         'AccuracyCriterion': {
2024-04-22 11:11:08 [INFO]             'criterion': 'relative',
2024-04-22 11:11:08 [INFO]             'higher_is_better': True,
2024-04-22 11:11:08 [INFO]             'tolerable_loss': 0.01,
2024-04-22 11:11:08 [INFO]             'absolute': None,
2024-04-22 11:11:08 [INFO]             'keys': <bound method AccuracyCriterion.keys of <neural_compressor.config.AccuracyCriterion object at 0x000002D685301C40>>,
2024-04-22 11:11:08 [INFO]             'relative': 0.01
2024-04-22 11:11:08 [INFO]         },
2024-04-22 11:11:08 [INFO]         'approach': 'post_training_weight_only',
2024-04-22 11:11:08 [INFO]         'backend': 'onnxrt_dml_ep',
2024-04-22 11:11:08 [INFO]         'calibration_sampling_size': [
2024-04-22 11:11:08 [INFO]             8
2024-04-22 11:11:08 [INFO]         ],
2024-04-22 11:11:08 [INFO]         'device': 'gpu',
2024-04-22 11:11:08 [INFO]         'diagnosis': False,
2024-04-22 11:11:08 [INFO]         'domain': 'auto',
2024-04-22 11:11:08 [INFO]         'example_inputs': 'Not printed here due to large size tensors...',
2024-04-22 11:11:08 [INFO]         'excluded_precisions': [
2024-04-22 11:11:08 [INFO]         ],
2024-04-22 11:11:08 [INFO]         'framework': 'onnxruntime',
2024-04-22 11:11:08 [INFO]         'inputs': [
2024-04-22 11:11:08 [INFO]         ],
2024-04-22 11:11:08 [INFO]         'model_name': '',
2024-04-22 11:11:08 [INFO]         'ni_workload_name': 'quantization',
2024-04-22 11:11:08 [INFO]         'op_name_dict': None,
2024-04-22 11:11:08 [INFO]         'op_type_dict': {
2024-04-22 11:11:08 [INFO]             '.*': {
2024-04-22 11:11:08 [INFO]                 'weight': {
2024-04-22 11:11:08 [INFO]                     'bits': [
2024-04-22 11:11:08 [INFO]                         4
2024-04-22 11:11:08 [INFO]                     ],
2024-04-22 11:11:08 [INFO]                     'group_size': [
2024-04-22 11:11:08 [INFO]                         32
2024-04-22 11:11:08 [INFO]                     ],
2024-04-22 11:11:08 [INFO]                     'scheme': [
2024-04-22 11:11:08 [INFO]                         'asym'
2024-04-22 11:11:08 [INFO]                     ],
2024-04-22 11:11:08 [INFO]                     'algorithm': [
2024-04-22 11:11:08 [INFO]                         'AWQ'
2024-04-22 11:11:08 [INFO]                     ]
2024-04-22 11:11:08 [INFO]                 }
2024-04-22 11:11:08 [INFO]             }
2024-04-22 11:11:08 [INFO]         },
2024-04-22 11:11:08 [INFO]         'outputs': [
2024-04-22 11:11:08 [INFO]         ],
2024-04-22 11:11:08 [INFO]         'quant_format': 'QOperator',
2024-04-22 11:11:08 [INFO]         'quant_level': 'auto',
2024-04-22 11:11:08 [INFO]         'recipes': {
2024-04-22 11:11:08 [INFO]             'smooth_quant': False,
2024-04-22 11:11:08 [INFO]             'smooth_quant_args': {
2024-04-22 11:11:08 [INFO]             },
2024-04-22 11:11:08 [INFO]             'layer_wise_quant': False,
2024-04-22 11:11:08 [INFO]             'layer_wise_quant_args': {
2024-04-22 11:11:08 [INFO]             },
2024-04-22 11:11:08 [INFO]             'fast_bias_correction': False,
2024-04-22 11:11:08 [INFO]             'weight_correction': False,
2024-04-22 11:11:08 [INFO]             'gemm_to_matmul': True,
2024-04-22 11:11:08 [INFO]             'graph_optimization_level': None,
2024-04-22 11:11:08 [INFO]             'first_conv_or_matmul_quantization': True,
2024-04-22 11:11:08 [INFO]             'last_conv_or_matmul_quantization': True,
2024-04-22 11:11:08 [INFO]             'pre_post_process_quantization': True,
2024-04-22 11:11:08 [INFO]             'add_qdq_pair_to_weight': False,
2024-04-22 11:11:08 [INFO]             'optypes_to_exclude_output_quant': [
2024-04-22 11:11:08 [INFO]             ],
2024-04-22 11:11:08 [INFO]             'dedicated_qdq_pair': False,
2024-04-22 11:11:08 [INFO]             'rtn_args': {
2024-04-22 11:11:08 [INFO]             },
2024-04-22 11:11:08 [INFO]             'awq_args': {
2024-04-22 11:11:08 [INFO]             },
2024-04-22 11:11:08 [INFO]             'gptq_args': {
2024-04-22 11:11:08 [INFO]             },
2024-04-22 11:11:08 [INFO]             'teq_args': {
2024-04-22 11:11:08 [INFO]             }
2024-04-22 11:11:08 [INFO]         },
2024-04-22 11:11:08 [INFO]         'reduce_range': False,
2024-04-22 11:11:08 [INFO]         'TuningCriterion': {
2024-04-22 11:11:08 [INFO]             'max_trials': 100,
2024-04-22 11:11:08 [INFO]             'objective': [
2024-04-22 11:11:08 [INFO]                 'performance'
2024-04-22 11:11:08 [INFO]             ],
2024-04-22 11:11:08 [INFO]             'strategy': 'basic',
2024-04-22 11:11:08 [INFO]             'strategy_kwargs': None,
2024-04-22 11:11:08 [INFO]             'timeout': 0
2024-04-22 11:11:08 [INFO]         },
2024-04-22 11:11:08 [INFO]         'use_bf16': True
2024-04-22 11:11:08 [INFO]     }
2024-04-22 11:11:08 [INFO] }
2024-04-22 11:11:08 [WARNING] [Strategy] Please install `mpi4py` correctly if using distributed tuning; otherwise, ignore this warning.
2024-04-22 11:11:08 [WARNING] The model is automatically detected as a non-NLP model. You can use 'domain' argument in 'PostTrainingQuantConfig' to overwrite it
2024-04-22 11:11:08 [WARNING] Graph optimization level is automatically set to ENABLE_BASIC. You can use 'recipe' argument in 'PostTrainingQuantConfig'to overwrite it
    self.run()
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 1304, in run
    self.finished.wait(self.interval)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "C:\Anaconda\envs\myolive\lib\threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
OverflowError: timeout value is too large
2024-04-22 11:11:21 [INFO] Do not evaluate the baseline and quantize the model with default configuration.
2024-04-22 11:11:21 [INFO] Quantize the model with default config.
Progress: [                    ] 0.78%Running model for sample 0
Running model for sample 1
2024-04-22 11:11:55 [ERROR] Unexpected exception Fail('[ONNXRuntimeError] : 1 : FAIL : C:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\dml\\DmlExecutionProvider\\src\\DmlCommandRecorder.cpp(371)\\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.\r\n') happened during tuning.
Traceback (most recent call last):
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\strategy\strategy.py", line 508, in traverse
    q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
    res = func(*args, **kwargs)
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1965, in quantize
    tmp_model = awq_quantize(
  File "C:\Anaconda\envs\myolive\lib\site-packages\neural_compressor\adaptor\ox_utils\weight_only.py", line 844, in awq_quantize
    output = session.run([input_name], inp)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommandRecorder.cpp(371)\onnxruntime_pybind11_state.pyd!00007FFE81F92BFE: (caller: 00007FFE81F79804) Exception(1) tid(3004c) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

2024-04-22 11:11:55 [ERROR] Specified timeout or max trials is reached! Not found any quantized model which meet accuracy goal. Exit.
Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\llm.py", line 391, in <module>
    main()
  File "C:\Olive\examples\directml\llm\llm.py", line 349, in main
    optimize(
  File "C:\Olive\examples\directml\llm\llm.py", line 231, in optimize
    olive_run(olive_config)
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\workflows\run\run.py", line 283, in run
    return run_engine(package_config, run_config, data_root)
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\workflows\run\run.py", line 237, in run_engine
    engine.run(
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 264, in run
    run_result = self.run_accelerator(
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 346, in run_accelerator
    output_footprint = self.run_search(
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 531, in run_search
    should_prune, signal, model_ids = self._run_passes(
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 826, in _run_passes
    model_config, model_id = self._run_pass(
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\engine\engine.py", line 934, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, data_root, output_model_path, pass_search_point)
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\systems\local.py", line 31, in run_pass
    output_model = the_pass.run(model, data_root, output_model_path, point)
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\passes\olive_pass.py", line 221, in run
    output_model = self._run_for_config(model, data_root, config, output_model_path)
  File "C:\Anaconda\envs\myolive\lib\site-packages\olive\passes\onnx\inc_quantization.py", line 588, in _run_for_config
    if q_model.is_large_model:
AttributeError: 'NoneType' object has no attribute 'is_large_model'

@PatriceVignola
Copy link
Contributor

I'm not sure what this warning is about (it comes from INC), but you definitely don't need an NPU for the quantization. I think it's likely that your device is running out of memory here, since 16gb of VRAM is barely enough to run the fp16 model normally, and quantization is more demanding. We have only confirmed that the quantization is working with RTX 4090 cards.

We are looking at different quantization options since a lot of the AWQ quantization options out there are hard to use on consumer hardware and generally require powerful server machines or powerful GPUs to complete in a timely manner.

If all you're interested in is converting to 4 bit to test the performance of the model, you can play around with the script and change the quantization strategy here to RTN:

"algorithm": quant_strategy.upper(),

It's not something that we have tested though since RTN is generally bad for LLMs.

@jojo1899
Copy link
Author

jojo1899 commented Apr 22, 2024

I was able to quantize the Mistral-7B on the same hardware using examples/mistral/mistral_int4_optimize.json. But I could not run inference on the quantized model using DML EP (see this issue for more details). I will try using that quantized model with examples/directml/llm/run_llm_io_binding.py for inference.

Regarding the code in LLM Optimization with DirectML, although I could not quantize using AWQ, I could convert Mistral successfully using the following.
python llm.py --model_type=mistral-7b-chat
The log shows that it also successfully carried out inference using the prompt "What is the lightest element?". However, when I try to infer using python run_llm_io_binding.py --model_type=mistral-7b-chat --prompt="What is the lightest element?", it does not work most of the times.

Here is the successful log from Mistral conversion to onnx format

C:\Olive\examples\directml\llm>python llm.py --model_type=mistral-7b-chat

Optimizing mistralai/Mistral-7B-Instruct-v0.1
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.95s/it]
[2024-04-22 11:45:40,473] [INFO] [run.py:261:run] Loading Olive module configuration from: C:\Anaconda\envs\myolive\lib\site-packages\olive\olive_config.json
[2024-04-22 11:45:40,489] [INFO] [accelerator.py:336:create_accelerators] Running workflow on accelerator specs: gpu-dml
[2024-04-22 11:45:40,489] [INFO] [engine.py:106:initialize] Using cache directory: cache
[2024-04-22 11:45:40,489] [INFO] [engine.py:262:run] Running Olive on accelerator: gpu-dml
[2024-04-22 11:45:40,504] [INFO] [engine.py:864:_run_pass] Running pass convert:OnnxConversion
[2024-04-22 11:49:28,442] [INFO] [engine.py:951:_run_pass] Pass convert:OnnxConversion finished in 227.921767 seconds
[2024-04-22 11:49:28,457] [INFO] [engine.py:864:_run_pass] Running pass optimize:OrtTransformersOptimization
failed in shape inference <class 'AssertionError'>
Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
failed in shape inference <class 'AssertionError'>
[2024-04-22 12:02:21,910] [INFO] [transformer_optimization.py:420:_replace_mha_with_gqa] Replaced 32 MultiHeadAttention nodes with GroupQueryAttention
[2024-04-22 12:02:32,629] [INFO] [engine.py:951:_run_pass] Pass optimize:OrtTransformersOptimization finished in 784.171654 seconds
[2024-04-22 12:02:32,654] [INFO] [engine.py:842:_run_passes] Run model evaluation for the final model...
[2024-04-22 12:02:42,560] [INFO] [footprint.py:101:create_pareto_frontier] Output all 3 models
[2024-04-22 12:02:42,560] [INFO] [footprint.py:120:_create_pareto_frontier_from_nodes] pareto frontier points: 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml
{
  "latency-avg": 86.65992
}
[2024-04-22 12:02:42,560] [INFO] [engine.py:361:run_accelerator] Save footprint to footprints\mistralai_Mistral-7B-Instruct-v0.1_gpu-dml_footprints.json.
[2024-04-22 12:02:42,576] [INFO] [engine.py:279:run] Run history for gpu-dml:
[2024-04-22 12:02:42,629] [INFO] [engine.py:567:dump_run_history] run history:
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| model_id                                                                           | parent_model_id                                                                    | from_pass                   |   duration_sec | metrics                   |
+====================================================================================+====================================================================================+=============================+================+===========================+
| ce39a7112b2825df5404fbb628c489ab                                                   |                                                                                    |                             |                |                           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-46a1dd3a2459690b350e4070c8e2c14a | ce39a7112b2825df5404fbb628c489ab                                                   | OnnxConversion              |        227.922 |                           |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
| 1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml           | 0_OnnxConversion-ce39a7112b2825df5404fbb628c489ab-46a1dd3a2459690b350e4070c8e2c14a | OrtTransformersOptimization |        784.172 | {                         |
|                                                                                    |                                                                                    |                             |                |   "latency-avg": 86.65992 |
|                                                                                    |                                                                                    |                             |                | }                         |
+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+-----------------------------+----------------+---------------------------+
[2024-04-22 12:02:42,639] [INFO] [engine.py:294:run] No packaging config provided, skip packaging artifacts
Optimized Model   : C:\Olive\examples\directml\llm\cache\models\1_OrtTransformersOptimization-0-d4c4ec660cc893c3eeab183690fc3aca-gpu-dml\output_model\model.onnx
Copying optimized model...
The optimized pipeline is located here: C:\Olive\examples\directml\llm\models\optimized\mistralai_Mistral-7B-Instruct-v0.1
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.

Here are logs from eight inference attempts of which only two attempts worked.

**RUN 1 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="The world in 2099 is"

**RUN 2 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
2024-04-22 14:20:54.6846266 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31a68) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
    run_llm_io_binding(
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
    llm_session = onnxruntime.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31a68) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

**RUN 3 (WORKED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.

**RUN 4 (WORKED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
The lightest element is hydrogen with an atomic number of 1 and atomic weight of approximately 1.008 g/mol.

**RUN 5 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="The world in 2099 is"
2024-04-22 14:28:01.8389389 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31bf0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
    run_llm_io_binding(
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
    llm_session = onnxruntime.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(31bf0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

**RUN 6 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="How is the world in 2099?"
2024-04-22 14:28:20.0418613 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3c0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
    run_llm_io_binding(
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
    llm_session = onnxruntime.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3c0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

**RUN 7 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the heaviest element?"

**RUN 8 (FAILED)**
C:\Olive\examples\directml\llm>python run_llm_io_binding.py --device dml --model_type=mistral-7b-chat --prompt="What is the lightest element?"
2024-04-22 14:39:18.5897298 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_de9340899c8cfefde68f4d8c5936aa80>::operator ()] Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3e0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

Traceback (most recent call last):
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 183, in <module>
    run_llm_io_binding(
  File "C:\Olive\examples\directml\llm\run_llm_io_binding.py", line 53, in run_llm_io_binding
    llm_session = onnxruntime.InferenceSession(
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "C:\Anaconda\envs\myolive\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: C:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(938)\onnxruntime_pybind11_state.pyd!00007FFEAB9DC972: (caller: 00007FFEAB9DC752) Exception(2) tid(3a3e0) 887A0007 The GPU will not respond to more commands, most likely because some other application submitted invalid commands.
The calling application should re-create the device and continue.

Any tips on what is happening here?

@jojo1899
Copy link
Author

jojo1899 commented Apr 24, 2024

UPDATE: I started using onnxruntime-genai-directml 0.2.0rc3 and it finally worked!!
I tried so many different things that it is hard to summarize them. Anyway, here are a couple of things I tried:

  1. Converting, optimizing, and quantizing the mistralai/Mistral-7B-Instruct-v0.1 huggingface model with the DmlExecutionProvider using the code in LLM Optimization with DirectML. I was able to convert and optimize the model, but not quantize it. Check my error log from the above conversation.
  2. Quantizing the optimized onnx model.
    The mistralai/Mistral-7B-Instruct-v0.1 huggingface model converted to onnx format was in the cache/models/output_model/0_OnnxConversion-... directory and 27 GB on disk. The optimized onnx model was in the cache/models/output_model/1_OrtTransformersOptimization-... directory and 13.5 GB on disk. I investigated ways to quantize the optimized onnx model to INT4 but nothing seemed to work as every method kept giving some errors. Also, inference using the optimized onnx model didn't work for me. I believe the issues with quantization and inference were arising due to something specific to 'optimizing' the onnx model, but I am not sure about it.

Finally, I quantized and performed inference using onnxruntime-genai-directml 0.2.0rc3.
Quantization:
python -m onnxruntime_genai.models.builder -m mistralai/Mistral-7B-Instruct-v0.1 -e dml -p int4 -o ./models/mistral-int4
Inference:
python model-qa.py -m ./models/mistral-int4

The quantization was a bit too fast (took me 1-2 min). However, the quality of the quantized model is really good and I saw no weird responses from the LLM. The size of this INT4 quantized model on disk is 3.97 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants