Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define chunk size for ReadableStream created by blob::stream() #144

Open
kevin-matthew opened this issue Jan 3, 2020 · 12 comments
Open

Comments

@kevin-matthew
Copy link

Currently regarded as 'Issue 1' I have recently come across a project that forces me to assume the chunk size for the stream. Testing with chromium and firefox, the chunk size appears to be 0x10000 or 65536. I cannot find a reason why it's this particular number.

As I'm making a wasm module, I must allocate memory space when going through files 5GB+ in size. I will allocate 0x10000 bytes for now, but only out of assumption... if there's a browser out there that does not follow this assumption then there will be fatal bugs.

I'm not sure if this is w3c/FileAPI/'s or streams.spec.whatwg.org's jurisdiction. But neither of them has an exact number.

@annevk
Copy link
Member

annevk commented Jan 3, 2020

Is this with BYOB streams? Could you explain the issue in some more depth perhaps? I would have expected the chunk size to be implementation-dependent and perhaps to also depend on the hardware in use, but maybe that's not ideal. fetch() would have a similar issue.

@kevin-matthew
Copy link
Author

So here is an in-depth explanation.

I'm making a wasm component that deals with downloading files asynchronously. When the file is downloaded I then need to process the entire file inside the browser. The wasm module handles the process however wasm has limited memory, so files as large as 5GB must remain in javascript's memory space (or where ever downloaded file data is stored, I'm not sure). In order to process the data -- for example, the aforementioned 5GB -- I must stream it through the wasm module.
To do this I

  1. Send a XMLHttpRequest with responseType = blob that will in turn GET a large file
  2. Once downloaded (ReadyState==4), get the XMLHttpRequest's response object - which will be a blob - then call it's stream().
  3. With that stream, call getReader() (with no arguments) which will return a ReadableStreamDefaultReader (note I would like to use a BYOB buffer however its not supported at all at this moment.)
  4. With that ReadableStreamDefaultReader, all calls to read yields a chunk that is always 0x10000 in length, the chunk can be smaller, but only if it is the last amount of data being returned by the file.

Now, none of this is a problem (except the fact that I am forced to use recursive-async engineering for a simple read command). However in step 4., I'm assuming that my wasm module will only ever need a buffer that's 0x10000 bytes in length. That number is not specified anywhere. It would be handy if it was... then I -- as a WASM developer -- would know exactly how much memory I need to allocate for all applications, making my wasm very efficient

@annevk
Copy link
Member

annevk commented Jan 5, 2020

cc @ricea

@ricea
Copy link

ricea commented Jan 6, 2020

This is under the FileAPI's jurisdiction.

It's implementation-defined, and difficult to put tight constraints on without forcing implementations to do inefficient things. I hope Firefox and Chromium arrived at the same size by coincidence rather than reverse-engineering.

An implementation that returned 1 byte chunks would clearly be unreasonably inefficient. An implementation which returned the whole blob as a single chunk would be unreasonably unscalable. So it clearly is possible to define some bounds on what is a "reasonable" size.

I would recommend using dynamic allocation to store the chunks in wasm if possible, and assume that implementations will behave reasonably.

In the standard, it would probably be good to enforce "reasonable" behaviour by saying that no chunk can be >1M in size and no non-terminal chunk can be <512 bytes. Maybe that second constraint can be phrased more carefully to allow for ring-buffer implementations that may occasionally produce small chunks but mostly don't.

Alternatively, the standard could be extremely prescriptive and require 65536 byte non-terminal chunks, based in the assumption that any new implementation can be made to comply without too much loss of efficiency.

@annevk
Copy link
Member

annevk commented Jan 6, 2020

What's the chance of us regretting such limits in 10 years? On the other hand, if applications are already depending on existing limits, maybe this will eventually require a toggle of sorts.

@kevin-matthew
Copy link
Author

To express my opinion, I think defining a maximum of 0x10000 bytes would be an extremely acceptable idea, and certainly future proof. For comparison, the max size of a UDP packet is around that size as well (0x10000-1 to be exact), and no one has complained about it since its definition in 1980.

However, keep in mind, for my particular application, I would only need a maximum defined. As that maximum alone would allow me to optimize heap allocation. Another solution outside of defining a maximum for implementations would be to allow developers like me to pass an argument that would define the maximum in run-time... but at that point, we'd be redefining the BYOB implementation.

@domenic
Copy link
Contributor

domenic commented Jan 6, 2020

I agree with @ricea that it's better to write your application code to be resilient to larger (or smaller) chunk sizes, e.g. slicing the buffers as appropriate.

If you need control over the buffer sizes, then BYOB readers are the way to go, and we should not change the behavior of the default reader just because folks haven't implemented BYOB readers yet. Instead, we should take this as a potential signal to up the priority of BYOB.

@annevk
Copy link
Member

annevk commented Jan 6, 2020

Theoretically I agree, but practically if people are going to write code assuming limits and don't do the due diligence of checking whether that is future proof we might well be stuck and have to define such a limit in the future.

@guest271314
Copy link

Is the

65536

re a File from <input type="file">? Also observed the same result when input is a File object.

Cannot the chunks be split into the desired size using TypedArray.slice() or TypedArray.subarray() (with remainder temporarily stored and prepended to the next chunk) client-side?

@guest271314
Copy link

However in step 4., I'm assuming that my wasm module will only ever need a buffer that's 0x10000 bytes in length.

The WASM module will need to handle to last bytes of the file anyway, therefore defining a chunk size for blob::stream() cannot be guaranteed to be any particular size due to

the chunk can be smaller, but only if it is the last amount of data being returned by the file.

particularly where the input is not always the same file.

Encountered a similar case where the value (Uint8Array) from ReadableStream can be any arbitrary length during the read, with the last read having a length of the remainder of the file.

There is also an edge case where if Disable cache is checked at Network tab at DevTools at Chromium operations that slice and splice the input into specific length can have unexpected results, i.e., the file is not sliced to the desired size https://bugs.chromium.org/p/chromium/issues/detail?id=1063524#c1.

FWIW, the solution that am currently using https://github.com/guest271314/AudioWorkletStream/blob/master/audioWorklet.js#L10 to handle input to AudioWorkletProcessor from a ReadableStream, where the value is a Uint8Array that needs to be passed to Uint16Array (requires a length having a multiple of 2), meaning it is necessary to slice() and splice() input, carry over the remainder until the next value is read and prepend the carried over values to the next Uint8Array until the length is a multiple of 2. In AudioWorkletProcessor usies Float32Arrays for input and output and where if the Float32Array set has a length less than 128 glitches and gaps in playback are observable due to the remainder of Float32Array being filled with 0's, or alternatively, when length set is greater than 128 a

RangeError: offset is out of bounds

is thrown - which happens to occur when Disable cache is checked at Network tab at DevTools.

@guest271314
Copy link

Another option to handle arbitrary Uint8Array length at read() is to process exactly 1 byte at time.

@lemanschik
Copy link

lemanschik commented Nov 9, 2023

Isn't this enough?

new Response(
  Uint8Array.from(new Array(1024)).buffer
).body.pipeThrough(
  new TransformStream({ type: "byte", transform: (chunk,c) => chunk.map((byte)=>c.enqueue(byte||"1"))})
).pipeThrough(
  new TransformStream({ type: "byte" },new ByteLengthQueuingStrategy({ highWaterMark: 512,size:(c)=> console.log(512,c) || 512 }))
).pipeTo(new WritableStream({ write(c) { console.log(c.length) }}))

i just ask for a frind :)

why isn't the default behavior that TransformStream which is a passthrough stream converts it to byte

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants