Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebNN should support NPU and QDQ operations #623

Open
wchao1115 opened this issue Mar 27, 2024 · 6 comments
Open

WebNN should support NPU and QDQ operations #623

wchao1115 opened this issue Mar 27, 2024 · 6 comments

Comments

@wchao1115
Copy link
Collaborator

Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:

  1. An ability to specify a device type for the NPU. Unlike more general-purpose devices such as the GPU and CPU, an NPU supports a limited finite set of operations without programmability support. To an extent needed to keep model execution stable and more predictable, the notion of a fallback device is needed to support NPU acceleration during model inference.
  2. A minimum set of operators required to support quantized models. Because most NPU utilizes a much simpler and less power-hungry low-bit integer arithmetic units, models targeting the NPU almost always need to be quantized first. The bare minimal support here in terms of operators are just two -- the quantizeLinear and dequantizeLinear operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e. conv2dInt and matmulInt.
@anssiko
Copy link
Member

anssiko commented Mar 27, 2024

@wchao1115, thanks for this proposal that outlines key elements for NPU support. I'll schedule this important topic for discussion at our upcoming meeting.

As you noted, these topics (NPU device type, support for quantized models) have been explored in the group prior and have been awaiting implementation experience. The timing is now appropriate for the group to reinvigorate this topic with NPU platforms more widely available and in the hands of consumers. Most importantly, the group can now validate proposed spec designs with implementation experience per our established work mode.

I'm looking forward to this discussion. Meanwhile, questions and comments are welcome in this issue from everyone.

@philloooo
Copy link
Contributor

philloooo commented Mar 28, 2024

Hi thanks for bringing this up! I'd like to highlight a couple things based on my current implementation experience:

  1. I don't think we can assume CPU and GPU device type always support everything, and only NPU is the outlier. For chromium implementations, tflite has a subset of ops supported for GPU backend. On CoreML, also opset coverage is like CPU > GPU > ANE. So if we want to provide fallback, we will need that for both GPU and NPU.
  2. On CoreML, there is no option to NOT provide fallback. The computeUnits options are: {cpu, gpu, ane}, {cpu}, {cpu, gpu}, {cpu, ane}. So there is no way you can target just the ANE (aka NPU).
  3. CoreML also could decide to execute on CPU even when that op is supported on ANE if it deems that it's more efficient for that case. So the {cpu, ane} option doesn't actually mean "only fallback to CPU when something is not supported on ANE" , but means "I will figure out what's the most efficient way to execute using CPU and ANE" . So it doesn't seem to match well with the current proposal here.

@fdwr
Copy link
Collaborator

fdwr commented Apr 18, 2024

We've discussed 3 possible options for extending the MLContextOptions::MLDeviceType:

  1. deviceType: "npu" (currently prototyped in Chromium)
    ➕ Very simple API
    ➕ Least to test
    ➕ Affords backends the most control for fallback, since only the primary device preference is specified.
    ➖ App cannot specify the fallback preference, as the system instead decides any fallback devices (though, do web apps know best, better than the system?).

  2. deviceType: "npu"
    fallbackDeviceType: "gpu"
    ➕ More flexible, as app can state secondary preference (but not a 3rd preference)
    ➖ A little more complex API, but fallbackDeviceType would be optional anyway.
    ➖ More to test and verify. You have to consider which all combinations are valid. e.g. Is {deviceType: "npu" fallbackDeviceType: "cpu"} valid? That would likely require the graphBuilder to partition the graph per node because NPU's have a smaller core functionality and lack the same operator coverage of more generic ML devices. What about redundant statements like {deviceType: "gpu", fallbackDeviceType: "gpu"}? Would it ever make sense to fall back from a more capable device to a narrow one like {deviceType: "gpu", fallbackDeviceType: "npu"}? What happens when you specify a combination {deviceType: "npu", fallbackDeviceType: "gpu"} that isn't supported on the backends like CoreML computeUnits?...

  3. deviceTypes: ["npu", "gpu"]
    ➕ Most flexible, allowing control and several devices in preferred order: ["npu", "gpu", "cpu"] (this is functionally similar to the bitflags in CoreML computeUnits).
    ➖ More complicated. More to test and verify.
    ➖ Platforms in practice don't actually support that much flexibility anyway. See CoreML computeUnits from Phillis, which has limited permutations. On Windows, the potential combinations will differ. If WebNN permits this much control, but the browser ignores that, then that is misleading.

  4. deviceType: "npu", excludesDeviceTypes: ["gpu"] (from Phillis below)
    Sometimes you care more about excluding a specific device, preferring to extend battery life over using the faster but power-hungry GPU. It's similar to option 3 with a different emphasis.

Other considerations

Error handling: If a device type does not exist at all, like asking for an NPU on a machine without one or a GPU on a headless server, then navigator.ml.createContext could fail. So the client should try again with a different device type. Such early failure is useful, before you've constructed too much of your graph. Note on Apple via CoreML, that's not really an option, as you can't ask for GPU or NPU without also getting CPU fallback.

Ultimate fallback: If navigator.ml.createContext succeeds for a given device type, then we should not introduce errors much later during graph node construction or the build call which would be very hard for the caller to unwind, or at least very inefficient to recreate the graph all over again with a new device type. So there should be a fallback behavior, and if any backend has incomplete coverage, then we want an "ultimate fallback" backend (like a catch-all universal font in font fallback) that handles every WebNN operator, which is typically the CPU backend. Currently (2024-04-19) the Chromium WebNN implementation is in the unusual state that the GPU backend is more complete in operator coverage than the CPU one, but eventually the CPU backend will catch up.

Quantized operators: These are necessary for NPU but are also independent, as they are useful for GPU and CPU too.

Feedback

Feedback welcome below. I have my preferences, but want to hear from you, and whether any other options/considerations are missing.

@zolkis
Copy link
Collaborator

zolkis commented Apr 24, 2024

Option 3 seems the best to me, also used in e.g. OpenVINO, and allows future interpretation for split/combined execution on multiple accelerators.

@philloooo
Copy link
Contributor

I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices.

The benefit of option 3 is use case like "I want to use anything except GPU" to load balance or something. But I feel we can explore this option when we gain more concrete needs from developers. A 4th option to satisfy the same need is : deviceType: "npu", excludesDeviceTypes: ["gpu"].

For now option 1 seems a good starting point?

@fdwr
Copy link
Collaborator

fdwr commented May 8, 2024

At the 2024-05-02 meeting, the W3C group agreed to start with option 1 (reserving the option to potentially expand if implementation experience shows need). Next step, spec update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants