-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebNN should support NPU and QDQ operations #623
Comments
@wchao1115, thanks for this proposal that outlines key elements for NPU support. I'll schedule this important topic for discussion at our upcoming meeting. As you noted, these topics (NPU device type, support for quantized models) have been explored in the group prior and have been awaiting implementation experience. The timing is now appropriate for the group to reinvigorate this topic with NPU platforms more widely available and in the hands of consumers. Most importantly, the group can now validate proposed spec designs with implementation experience per our established work mode. I'm looking forward to this discussion. Meanwhile, questions and comments are welcome in this issue from everyone. |
Hi thanks for bringing this up! I'd like to highlight a couple things based on my current implementation experience:
|
We've discussed 3 possible options for extending the MLContextOptions::MLDeviceType:
Other considerationsError handling: If a device type does not exist at all, like asking for an NPU on a machine without one or a GPU on a headless server, then Ultimate fallback: If Quantized operators: These are necessary for NPU but are also independent, as they are useful for GPU and CPU too. FeedbackFeedback welcome below. I have my preferences, but want to hear from you, and whether any other options/considerations are missing. |
Option 3 seems the best to me, also used in e.g. OpenVINO, and allows future interpretation for split/combined execution on multiple accelerators. |
I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices. The benefit of option 3 is use case like "I want to use anything except GPU" to load balance or something. But I feel we can explore this option when we gain more concrete needs from developers. A 4th option to satisfy the same need is : For now option 1 seems a good starting point? |
At the 2024-05-02 meeting, the W3C group agreed to start with option 1 (reserving the option to potentially expand if implementation experience shows need). Next step, spec update. |
Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:
quantizeLinear
anddequantizeLinear
operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e.conv2dInt
andmatmulInt
.The text was updated successfully, but these errors were encountered: