-
Hi, we have a gpu (g4dn.xl) ec2 instance set-up to run our face recognition ai model. Running inference session usually takes 1-2 seconds for an image but whenever we launch a new instance with an ami of existing instance, for the first time when inference session runs, it takes around 150 seconds. but from the second time onwards it again takes 1-2 seconds. We want to understand why it happens only for the first time. We are working on autoscaling for our gpu instances and this thing affects start-up time for new instances. please help us understand this. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @AnkushRR, To verify my guess, you can try to get the optimized model by optimized_model_filepath, then directly inference it and see whether it can inference faster in the first run. If the first run still takes much time, please raise this issue in ONNXRuntime to get the best help from the Runtime experts. |
Beta Was this translation helpful? Give feedback.
Hi @AnkushRR,
It seems like a question for ONNXRuntime. My best guess is in the first run, ONNXRuntime does some optimization for your initial model, which takes additional time. And then the optimized model will replace your original model in place. For later inference, ONNXRuntime won't do those optimization again with the optimized model and therefore inference it with higher speed.
To verify my guess, you can try to get the optimized model by optimized_model_filepath, then directly inference it and see whether it can inference faster in the first run. If the first run still takes much time, please raise this issue in ONNXRuntime to get the best help from the Runtime experts.