Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLBuffer exploration doc #541

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

a-sully
Copy link
Contributor

@a-sully a-sully commented Jan 30, 2024

The included doc explores the design of #482. That issue is getting quite long and I believe there's a lot more to say, so for the sake of not overwhelming that issue I've created this doc.

Here's how I'd love for this to go:

  1. You read this doc because you're interested in making WebNN awesome
    • Pro tip: Use "View file" from the three-dot menu to view as formatted markdown
  2. You see something... interesting? Wrong? Exciting? Leave a comment on this PR! Inline comments on the doc are preferred
  3. We briefly discuss in that comment thread. From there, we either:
    • Open a new issue, if there's a non-trivial discussion to be had (perhaps we could get a "WebGPU interop" label? @anssiko @inexorabletash), or
    • Update the MLBuffer proposal to clarify the expected behavior
  4. Eventually, I close this PR. This PR is not intended to be merged, though I'd be happy to contribute its examples into an eventual explainer

I'm eager to hear your thoughts. Thanks!

@anssiko
Copy link
Member

anssiko commented Feb 2, 2024

I updated https://github.com/webmachinelearning/meetings/blob/main/telcons/2024-02-08-wg-agenda.md to include this exploration doc as an additional reference.

Also created webgpu interop label for us.

Copy link
Contributor

@huningxin huningxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explainer is very useful. Thanks for putting this together, @a-sully !

I left some comments for chained inference use case. Please take a look.


#### Questions:

- Can an `MLBuffer`'s size always be known at the time of buffer allocation?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the size should be known before the allocation. WebNN only supports static shape tensor (i.e. the shape can be quired by MLOperand.shape()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems reasonable that the size should always be known. I mostly just wanted to explicitly call it out as a constraint of this proposal :)

that the size of an `MLBuffer` must always be known at the time of buffer
allocation
- When will `inputMlBuffer` be deallocated if `destroy()` is not called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider create MLBuffer based on its usage?

For the chained inference use case, a buffer usage could be one of the following three:

  1. Input: ArrayBufferMLBufferMLGraph
  2. Output: MLGraphMLBufferArrayBuffer
  3. Intermediate: MLGraphMLBufferMLGraph

Different backends/devices may arrange/optimize the buffer allocation depending on its usage. For example, the current Chromium WebNN XNNPACK and DirectML backends arrange the input and output buffers as following:

  • XNNPACK backend allocates input buffer with XNN_EXTRA_BYTES because XNNPACK may read beyond array bounds for performance reason. The output buffer doesn't need to allocate extra bytes, because XNNPACK never write beyond array bounds.
  • For UMA (unified memory architecture) GPU, DirectML backend allocates input CPU buffer / output CPU buffer and bind them as GPU graph execution input / output. DirectML backend also sets appropriate CPU page property optimized for CPU writing or reading according to the cache architecture.
  • For NUMA (Non-UMA) GPU, DirectML backend always allocates default GPU buffers as GPU graph execution input and output. For input data uploading, it allocates dedicated upload CPU buffer and does two-steps uploading from CPU to GPU. Output data reading back is through a dedicated readback CPU buffer and two-steps downloading from GPU to CPU.

For intermediate buffer:

  • XNNPACK backend may still need to allocate extra bytes because XNNPACK would read data from it.
  • DirectML backend may just allocate buffer in DEFAULT_HEAP because it doesn't need CPU access.

The following table captures the buffer allocation difference according to usage:

Buffer Usage XNNPACK / CPU DirectML / UMA GPU DirectML / NUMA GPU
Input buffer with XNN_EXTRA_BYTES CUSTOM buffer
(CPUPageProperty = CacheCoherentUMA ? WRITE_BACK : WRITE_COMBINE,
MEMORY_POOL_L0)
DEFAULT buffer + UPLOAD buffer
Output buffer without XNN_EXTRA_BYTES CUSTOM buffer
(CPUPageProperty = WRITE_BACK,
MEMORY_POOL_L0)
DEFAULT buffer + READBACK buffer
Intermediate buffer with XNN_EXTRA_BYTES DEFAULT buffer (MEMORY_POOL_L0) DEFAULT buffer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question is whether we can keep these details implementation specific, and use MLBuffer to identify the buffers we pass between stages, eventually with dynamic labels such as input/output/intermediate? (in this case we will have to specify behavior in the spec in more details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider create MLBuffer based on its usage?

"usage" can mean a lot of different things (readable/writable? input/output/intermediate? usable by WebGPU? mappable in some way?) but in broad strokes, yes I agree :P

We should strive to ensure the user agent has enough information when an MLBuffer is created to allocate its buffer in the most optimal place, according to how it will be used. It's worth enumerating those uses in detail. The table is a very useful start - thanks!

I think the main question is whether we can keep these details implementation specific

IMHO it's critical that these details are opaque to the website. If creating an "input" MLBuffer for the XNNPack backend, the website should not be able to detect that XNN_EXTRA_BYTES have been added. For example:

const mlBuffer = mlContext.createBuffer({ usage: INPUT, size: 100 });
console.assert(mlBuffer.size, 100);

Similarly:

const gpuBuffer = mlBuffer.mapAsGpuBuffer(gpuDevice);
console.assert(gpuBuffer.size, mlBuffer.size);

// ...

const arrayBuffer = mlBuffer.mapAsync();
console.assert(arrayBuffer.byteLength, mlBuffer.size);

eventually with dynamic labels such as input/output/intermediate

Could you elaborate more on this? :)

#### Questions:

- This approach is flexible enough to allow for graph execution on all backends.
Do we need a separate `compute()` method?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They might be different if the graph has multiple outputs. For example, the current Chromium WebNN DirectML backend implementation allocates one default buffer and one readback buffer big enough for all outputs. When compute() is done by GPU, it copies all results from the default buffer to readback buffer, and then to array buffers in one shot and resolves the promise.

For dispatch(), a naive implementation may need to do that (enqueue a copy from default buffer to readback buffer, wait for GPU copy done, copy data from readback buffer to array buffer, resolve promise) for each buffer upon user calls MLBuffer.mapAsync().

Copy link

@bbernhar bbernhar Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from MLBuffer using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to determine if it's a requirement [for WebNN] to have the developer transfer data to/from MLBuffer using the exact same APIs, regardless of backend. If so, I don't think we can rely on buffer mapping alone to cover upload/download for any device type.

This discussion relates to Question 4 from #544:

  1. Can dispatch be exclusive to MLBuffer bindings? @bbernhar

At the core of this discussion seems to be the question of whether MLBuffer should be a device-agnostic buffer, or whether it's only expected to be allocated on the GPU or NPU for chained execution, or even just on the GPU in the case where we want WebGPU interop (effectively making it a GPUBuffer in disguise).

My earlier interpretation was that MLBuffer was proposing to be the former. In the case of WebGPU interop, for example, I would naively expect that as long as the appropriate usage flags are set, the user agent should be able to share memory if the memory can be shared (e.g. UMA GPUs); otherwise a copy would be made on both mapAsGpuBuffer() and unmapFromGpuBuffer()

Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)

To clarify some things on my end... I know that WebGPU has a very specific definition of "buffer mapping". I've been using the term much more loosely here (apologies if that's caused any confusion); in broad strokes to mean "transferring ownership", which may or may not require copies under the hood. For example:

  • For a CPU-based MLBuffer, "mapping" the MLBuffer to an ArrayBuffer (and vice versa) would behave very similarly to how compute() works today (see https://webmachinelearning.github.io/webnn/#mlnamedarraybufferviews-transfer). The user agent can make this "mapping" very cheap
  • For a dNPU-based MLBuffer, "mapping" the MLBuffer to a dGPU would require
    • on mapAsGpuBuffer(), copy the data into a new buffer on the GPU and set flags to e.g. mark the MLBuffer as inaccessible
    • on unmapFromGpuBuffer(), copy back the contents of the GPU buffer into the original buffer on the NPU and set flags to e.g. mark the MLBuffer as accessible once again

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a-sully, I agree we'll need more than what I proposed to cover all the use-cases here.

Reading through some of these issues again and recalling the discussion in the last WG meeting, it seems like @bbernhar you might be proposing the latter? Could you please clarify? :)

Since we require CPU backed MLBuffer(s) to be equally efficient (avoid copies), I don't think my original proposal to have MLBuffer be a (simple) control (vs storage) object will work =(.

It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.

// GPU staging buffers
ml_context = ML.createContext({"gpu"});
input_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_WRITE});
output_buf = ml_context.createBuffer({size: 4, usage: MLBufferUsage.MAP_READ});
// output_buf = ml_context.createBuffer({size: 4}); // pre-allocate on GPU only

// Populate MLBuffer input with source data
await input_buf.mapAsync(MLMapMode.WRITE);
const write_data = input_buf.getMappedRange();
/* use write_data */
input_buf.unmap();

// If dGPU, 
//  * input requires GPU copy due to MLBufferUsage.MAP_WRITE
//  * output requires GPU copy due to MLBufferUsage.MAP_READ.
// If iGPU,
//  * neither input or output requires GPU copy.
ml_context.dispatch({inputs: input_buf, outputs: output_buf});

// MLBuffer output was populated by dispatch()
await output_buf.mapAsync(MLMapMode.READ);
const read_data = output_buf.getMappedRange();
/* use read_data */
output_buf.unmap();

The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?

Copy link
Contributor Author

@a-sully a-sully Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we require CPU backed MLBuffer(s) to be equally efficient (avoid copies), I don't think my original proposal to have MLBuffer be a (simple) control (vs storage) object will work =(.

+1 I think this answers question 5 from #542:

  1. Does MLBuffer require explicit buffer usages (ex. input, output, or both)? @bbernhar

Now the question becomes what set of usage flags are needed?

It makes sense to me to have WebNN rely on buffer mapping (with buffer usages) to determine what gets copied [to/from the context/device] before or after dispatch executes. It seems possible we could keep the buffer mapping part like WebGPU, though.

+1. And this (familiar to developers!) WebGPU-like interface is flexible enough that we should be able to provide this same interface even if the buffer is not allocated on the GPU

The "mappable buffer gets filled for you" magic could be the only WebNN-specific behavior (for us WebGPU folk). Notice, it's a bit weird to see a "map" usage also perform a "copy". WDYT about creating a new descriptor type to clarify this transfer or copy behavior?

Seems reasonable. Hmmm... some ideas which come to mind:

  • rent - suggests that WebNN maintains ownership even when the buffer is used by JS or WebGPU. And the rental may be returned
  • lend - synonym to the above
  • transfer - analogous to ArrayBuffer.transfer(), which also may require a copy but is zero-copy when possible
  • import - parallels WebGPU, though perhaps a bit awkward
  • map - if we want to re-use familiar names like mapAsync() then maybe "map" isn't so bad, even if it hides a copy

Since the relationship is asymmetric (in that WebNN always owns the buffer), I think the naming should be, too. Some ideas for returning the buffer to WebNN include:

  • return
  • restore
  • takeBack
  • unmap (probably only if we go with "map" above)

Feel free to suggest other terms!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sample code!

At a high level, I don't think we can:

  • avoid having hidden data-synchronization across devices, and also
  • have consistent usage patterns of an MLBuffer regardless of which device it's allocated on

I'll give an example:

Setting both [MAP_WRITE and MAP_READ] could be the equivalent of mappedatcreation

WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?

Meanwhile, there's a clear use case for allowing dual-mapping if the MLBuffer is backed by the CPU or on an UMA system where buffer copies could be avoided altogether.


Just to make sure we're on the same page here, could you provide some sample code of what this would look like for both chained inference and WebGPU interop?

With the above "high level" hypothesis in mind, could you provide an example of WebGPU interop, too? :)

If the MLBuffer's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?


// (abbreviated)
ml_bindings.setCopyOutputs({"output1"}); // copy back, explicit
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();

What exactly does setCopyOutputs do? Where is the data copied to/from? For each memory access type?


If you have a tentative set of MLBufferUsageFlags in mind, please do share! :)

I think only two usages are required here: MAP_WRITE and MAP_READ. Setting both could be the equivalent of mappedatcreation.

We could assume that:

  • MAP_WRITE -> input buffers
  • MAP_READ -> output buffers

but that doesn't hold true in the chained execution use case, where outputs become inputs. In your example above, ml_buffer_output is used as both an input and output buffer, which means that e.g. the implementation of createBuffer() must know to allocate XNN_EXTRA_BYTES - and also to chop off those extra bytes when passing it as an input on the second inference.

Lacking explicit input/output/intermediate flags, this implies that MLBuffers to be used with the XNNPACK backend must always allocate XNN_EXTRA_BYTES... which doesn't seem unreasonable. But if there are more quirks like this on other platforms, though, then we may want to consider adding explicit input/output/intermediate flags?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU only allows setting one or the other. Will this dual-mapping also be allowed for all MLBuffers, including those backed by a non-UMA GPU/NPU? In that case, surely cross-device copies are needed for each mapping and unmapping?

I believe you can create a GPUBuffer with combined usages (MAP_READ | MAP_WRITE). It's only when you call mapAsync, it must be READ or WRITE mode, not both. Because this mapping operation works like WebGPU, it occurs on the same device used to create the buffer so by definition, no cross-device copies are allowed. We don't need to copy data until dispatch or import anyway.

If the MLBuffer's buffer is allocated in memory the GPU that we're renting to can access, then a copy is not necessary. If not, then it must be. Do we hide the copy as an implementation detail? Force the developer to specify a copy, even if it's not necessary? Something else?

Good point to clarify. I would expect importExternalBuffer to perform the copy as an implementation detail. This operation is about transferring a MLBuffer, which a copy is allowed. After importing, I believe we can simply GPUBuffer.destroy() it, which restores access to the shared MLBuffer with WebNN.

What exactly does setCopyOutputs do? Where is the data copied to/from? For each memory access type?

It tells WebNN to copy data between the context/device and CPU after dispatch executes, if required. So, it depends on the context device-type used to create the buffer and if it needs to bound to the CPU for I/O.

must always allocate XNN_EXTRA_BYTES... which doesn't seem unreasonable

+1. We have a similar situation on GPU backed MLBuffers - they also require alignment which always allocates extra bytes anyway.

Admittedly, there are downsides to this approach:

  • Can't call mapAsync anywhere you want.
  • Usages + bindings require validation.

But if the goal is have WebNN mapping behave like WebGPU, it's totally doable IMHO.

Copy link

@bbernhar bbernhar Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completion purposes, here's how the other API approach could work. Let's refer to the previous idea as "storage bindings". Under this design, mapAsync "hides" the copy from the web developer. Note: interop does not change so it's left out here intentionally.

ml_context = ML.createContext({"gpu"});

// MLBuffer data is created on the device used by the context.
ml_buffer_input = ml_context.createBuffer({size: 4});
ml_buffer_output = ml_context.createBuffer({size: 4});

// Upon calling Unmap(), after a mapAsync(WRITE):
// * if CPU/iGPU, a copy to GPU IS NOT required.
// * Else dGPU/NPU, a copy to GPU IS required.
await ml_buffer_input.mapAsync(MLMapMode.WRITE);
const write_data = ml_buffer_input.getMappedRange();
// Write something into write_data
ml_buffer_input.unmap()

// Or use context.writeBuffer, regardless of device-type.

// Dispatch executes the copy required by calling Unmap().
ml_context.dispatch(graph, inputs, outputs);

// Upon calling mapAsync(READ):
// * if CPU/iGPU/NPU, copy to GPU IS NOT required.
// * Else dGPU, a copy to GPU IS required.
await ml_buffer_output.mapAsync(MLMapMode.READ);
const read_data = ml_buffer_output.getMappedRange();
// Read something into Read_data
ml_buffer_output.unmap();

// So long as you don't use mapAsync, re-using outputs as inputs doesn't copy (ie. chained-inference).
ml_context.dispatch(graph, inputs, outputs);

Note, since this approach hides device data placement/movement, the WebNN developer shouldn't freely use mapAsync(READ) for output, like they could in WebGPU, as it could (inadvertently) convert into the more expensive readBuffer operation (internally).

It's also not obvious to me what happens to the (implicit) staging buffer that's created upon mapAsync. Similarly, deciding use of writeBuffer vs mapAsync(WRITE) for input is not obvious since there is no mappable/staging buffer usage.

The approach is simpler (for implementations, namely) but comes at the cost of requiring the WebNN developer to reference the spec to actually understand what's going on and not writing code as they could elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback on this line of thought:

[@huningxin] Should we consider create MLBuffer based on its usage?

Yes, I think it is reasonable to have enums on MLBuffer creation that help the browser optimize memory usage.

[@bbernhar] I was thinking we could have a descriptor/binding set [for dispatch], more like what we have in WebGPU.

I think we should avoid having stateful binding APIs, if possible. They tend to act like global variables and are difficult to integrate with when you have multiple frameworks active in an application. Even in WebGPU, binding objects are created atomically from a dictionary and are immutable, not stateful.

Similarly, introducing a MLDispatchMode also makes it challenging for web developers when dispatch code is far away from code which calls mapAsync and writeBuffer. If you submit work with MLDispatchMode.Default and subsequently call mapAsync, the promise will reject because the buffer is in the wrong state. For debugging, it's nice to sprinkle mapAsync calls in the middle of a bunch of chained dispatches (without changing the dispatches themsleves) to see where things have gone south.

I would prefer that we keep things simple and have the inputs and outputs to dispatch be included with the actual dispatch API like what was originally proposed instead of introducing a binding API, especially a stateful binding API.

[@a-sully]
At a high level, I don't think we can:

  • avoid having hidden data-synchronization across devices, and also
  • have consistent usage patterns of an MLBuffer regardless of which device it's allocated on

For what it's worth, the WebGPU CG has still not speced UMA buffer transfers. See the outstanding issue Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory, which has been active since late 2021.

In general, I tend to agree with Austin. I think we should gain implementation experience with doing copies on discrete hardware in unmap (or dispatch) and see where it takes us. Both this and writeBuffer will require temporary buffer memory that needs to be managed by the browser.

We haven't yet touched on how WebNN interacts with WebGPU for hybrid devices. If the web developer creates a "high performance" WebNN context while the WebGPU device is on the "low power" adapter, the browser will need to copy between the two when transferring buffers between the APIs. Power preference is just an request so the web developer is not guaranteed to get what they request, especially over the entire course of their web session.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplicity of overloading mapAsync introduces inconsistent runtime behavior.

I believe that's the primary reason WebGPU and others don't overload it. For example, the WebGPU developer doesn't need to worry about OOMing when calling mapAsync or Umap since mappable buffers don't use a staging buffer and are just mapped. Debugging OOM between WebNN and WebGPU becomes inconsistent, the WebNN developer can't assert when mapAsync behaves like readBuffer or if MLBuffer's memory usage doubled. I could see how more established APIs like ONNX runtime preferred storage bindings, they like WebGPU, also want runtime and API consistency.

need to create an intermediate "readback" buffer to facilitate the transfer.
This may require two copies:\
*  high-GPU-bandwidth buffer → "readback" buffer →
`ArrayBuffer`*\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For UMA GPU, the implementation may eliminate the "readback" buffer and just need one copy.

@a-sully
Copy link
Contributor Author

a-sully commented Feb 21, 2024

Taking a step back, there are a few things I'm noticing.

On the one hand, this exploration doc has been very helpful in clarifying ambiguities in this proposal, generating new ideas, and distilling the key unresolved questions. On the other hand, I'm worried that we (myself included!) might be losing sight of the forest for the trees :)

I'd like to refocus our approach to designing MLBuffer on following goals:

  1. Prove out that the MLBuffer concept is feasible to implement on all platforms,
  2. Prove out that MLBuffer provides meaningful performance wins for the two use cases we've identified, and
  3. Avoid baking in any assumptions which would preclude adopting further optimizations in the future

My tentative suggestions are:

  • Start with the initially-proposed readBuffer() and writeBuffer() APIs as the only way to read/write data to an MLBuffer from script. This implies that:
    • writeBuffer() will always require at least one copy
    • Each readback to script will create a new ArrayBuffer (as well as intermediate copies as necessary) - though we could also consider a "bring your own buffer" version of this similar to how compute() outputs currently work as a reasonable improvement to avoid allocations
  • Take a phased approach to supporting WebGPU <-> WebNN interop:
    • Add a way for a developer to query for interop support: e.g. mlBuffer.canExportTo(gpuDevice)
    • Consider adding an MLBufferUsageFlag to indicate an intent to lend the buffer to WebGPU
    • Start with supporting renting an MLBuffer to a GPUDevice only if the MLBuffer was created from an MLContext that was created from that same GPUDevice (or however we decide to specify MLContext creation - see Simplify MLContext creation #322). Once we are confident that this can be implemented on all platforms... (and can get some perf results!)
    • Support UMA systems where we can still share the buffer with zero copies
    • Support systems where a copy would be required to share the MLBuffer with the xPU
      • Included in this bucket is that the key WebNN <-> WebGPU interop use case of integrating a dNPU into real-time video processing (see Integration with real-time video processing #226). To fully take advantage of the dNPU, it seems critical to support this use case without requiring synchronization from JS
      • Regarding the copy: we can discuss how to expose this copy to the developer (e.g. perhaps canExportTo() could somehow indicate whether a copy is required? TBD!) once we have a better grasp on how this will be implemented. In the meantime, we should avoid making design and implementation decisions which would prevent us from making this copy implicit in the "import"
  • Punt on the following features:
    • Buffer mapping to JS. Regarding goal (2) above, I'd like to assert that we can still get meaningful performance signals without this:
      • In the chained inference use case, the costs of additional copies to/from script should be dwarfed by the benefits of no longer needing to round-trip to the CPU between inferences (especially for repeated inferences)
      • In the primary use case for WebGPU interop - where WebNN to runs inference on a video frame from WebGPU - JS buffers are not involved at all
      • We expect WebNN to primarily be used by web ML frameworks (e.g. ONNX Runtime), which overwhelmingly use WASM. However, (I recently became aware that) mapAsync() is not actually zero-copy for WASM. This is why writeBuffer() is recommended for uploading data to WebGPU from WASM today
      • As a general observation (though of course there are exceptions), ML models tend to require large amounts of data to be uploaded (e.g. weights) and a comparatively small amount of data to be read back as outputs. An additional copy for readback of a small buffer may not matter much
      • Proposals such as WASM Memory Control could allow for bypassing JS memory altogether for data uploads
    • Minimizing buffer copies for UMA systems (including read-and-write-mappable buffers). These same questions apply to WebGPU, and as Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use gpuweb/gpuweb#2388 shows, there is not an obvious solution. I would prefer to co-design a solution with WebGPU, or fast-follow once they decide on a path forward

WDYT? @bbernhar @huningxin @RafaelCintron @zolkis

@bbernhar
Copy link

@a-sully I don't have a concern with the proposed plan to use readBuffer. I agree with the overall rational that solving the zero-copy problem should not be a goal for MLBuffer, at-least not initially.

@huningxin
Copy link
Contributor

@a-sully , your proposed plan sounds good to me.

Also +1 to readBuffer() having

a "bring your own buffer" version of this similar to how compute() outputs currently work as a reasonable improvement to avoid allocations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants