Define chunk size for `ReadableStream` created by `blob::stream()` #144

kevin-matthew · 2020-01-03T02:02:37Z

Currently regarded as 'Issue 1' I have recently come across a project that forces me to assume the chunk size for the stream. Testing with chromium and firefox, the chunk size appears to be 0x10000 or 65536. I cannot find a reason why it's this particular number.

As I'm making a wasm module, I must allocate memory space when going through files 5GB+ in size. I will allocate 0x10000 bytes for now, but only out of assumption... if there's a browser out there that does not follow this assumption then there will be fatal bugs.

I'm not sure if this is w3c/FileAPI/'s or streams.spec.whatwg.org's jurisdiction. But neither of them has an exact number.

The text was updated successfully, but these errors were encountered:

annevk · 2020-01-03T07:55:12Z

Is this with BYOB streams? Could you explain the issue in some more depth perhaps? I would have expected the chunk size to be implementation-dependent and perhaps to also depend on the hardware in use, but maybe that's not ideal. fetch() would have a similar issue.

kevin-matthew · 2020-01-04T20:44:51Z

So here is an in-depth explanation.

I'm making a wasm component that deals with downloading files asynchronously. When the file is downloaded I then need to process the entire file inside the browser. The wasm module handles the process however wasm has limited memory, so files as large as 5GB must remain in javascript's memory space (or where ever downloaded file data is stored, I'm not sure). In order to process the data -- for example, the aforementioned 5GB -- I must stream it through the wasm module.
To do this I

Send a XMLHttpRequest with responseType = blob that will in turn GET a large file
Once downloaded (ReadyState==4), get the XMLHttpRequest's response object - which will be a blob - then call it's stream().
With that stream, call getReader() (with no arguments) which will return a ReadableStreamDefaultReader (note I would like to use a BYOB buffer however its not supported at all at this moment.)
With that ReadableStreamDefaultReader, all calls to read yields a chunk that is always 0x10000 in length, the chunk can be smaller, but only if it is the last amount of data being returned by the file.

Now, none of this is a problem (except the fact that I am forced to use recursive-async engineering for a simple read command). However in step 4., I'm assuming that my wasm module will only ever need a buffer that's 0x10000 bytes in length. That number is not specified anywhere. It would be handy if it was... then I -- as a WASM developer -- would know exactly how much memory I need to allocate for all applications, making my wasm very efficient

annevk · 2020-01-05T12:04:58Z

cc @ricea

ricea · 2020-01-06T02:44:07Z

This is under the FileAPI's jurisdiction.

It's implementation-defined, and difficult to put tight constraints on without forcing implementations to do inefficient things. I hope Firefox and Chromium arrived at the same size by coincidence rather than reverse-engineering.

An implementation that returned 1 byte chunks would clearly be unreasonably inefficient. An implementation which returned the whole blob as a single chunk would be unreasonably unscalable. So it clearly is possible to define some bounds on what is a "reasonable" size.

I would recommend using dynamic allocation to store the chunks in wasm if possible, and assume that implementations will behave reasonably.

In the standard, it would probably be good to enforce "reasonable" behaviour by saying that no chunk can be >1M in size and no non-terminal chunk can be <512 bytes. Maybe that second constraint can be phrased more carefully to allow for ring-buffer implementations that may occasionally produce small chunks but mostly don't.

Alternatively, the standard could be extremely prescriptive and require 65536 byte non-terminal chunks, based in the assumption that any new implementation can be made to comply without too much loss of efficiency.

annevk · 2020-01-06T08:51:04Z

What's the chance of us regretting such limits in 10 years? On the other hand, if applications are already depending on existing limits, maybe this will eventually require a toggle of sorts.

kevin-matthew · 2020-01-06T15:51:43Z

To express my opinion, I think defining a maximum of 0x10000 bytes would be an extremely acceptable idea, and certainly future proof. For comparison, the max size of a UDP packet is around that size as well (0x10000-1 to be exact), and no one has complained about it since its definition in 1980.

However, keep in mind, for my particular application, I would only need a maximum defined. As that maximum alone would allow me to optimize heap allocation. Another solution outside of defining a maximum for implementations would be to allow developers like me to pass an argument that would define the maximum in run-time... but at that point, we'd be redefining the BYOB implementation.

domenic · 2020-01-06T16:00:18Z

I agree with @ricea that it's better to write your application code to be resilient to larger (or smaller) chunk sizes, e.g. slicing the buffers as appropriate.

If you need control over the buffer sizes, then BYOB readers are the way to go, and we should not change the behavior of the default reader just because folks haven't implemented BYOB readers yet. Instead, we should take this as a potential signal to up the priority of BYOB.

annevk · 2020-01-06T16:55:29Z

Theoretically I agree, but practically if people are going to write code assuming limits and don't do the due diligence of checking whether that is future proof we might well be stuck and have to define such a limit in the future.

guest271314 · 2020-03-22T17:57:41Z

Is the

65536

re a File from <input type="file">? Also observed the same result when input is a File object.

Cannot the chunks be split into the desired size using TypedArray.slice() or TypedArray.subarray() (with remainder temporarily stored and prepended to the next chunk) client-side?

guest271314 · 2020-03-22T19:27:31Z

However in step 4., I'm assuming that my wasm module will only ever need a buffer that's 0x10000 bytes in length.

The WASM module will need to handle to last bytes of the file anyway, therefore defining a chunk size for blob::stream() cannot be guaranteed to be any particular size due to

the chunk can be smaller, but only if it is the last amount of data being returned by the file.

particularly where the input is not always the same file.

Encountered a similar case where the value (Uint8Array) from ReadableStream can be any arbitrary length during the read, with the last read having a length of the remainder of the file.

There is also an edge case where if Disable cache is checked at Network tab at DevTools at Chromium operations that slice and splice the input into specific length can have unexpected results, i.e., the file is not sliced to the desired size https://bugs.chromium.org/p/chromium/issues/detail?id=1063524#c1.

FWIW, the solution that am currently using https://github.com/guest271314/AudioWorkletStream/blob/master/audioWorklet.js#L10 to handle input to AudioWorkletProcessor from a ReadableStream, where the value is a Uint8Array that needs to be passed to Uint16Array (requires a length having a multiple of 2), meaning it is necessary to slice() and splice() input, carry over the remainder until the next value is read and prepend the carried over values to the next Uint8Array until the length is a multiple of 2. In AudioWorkletProcessor usies Float32Arrays for input and output and where if the Float32Array set has a length less than 128 glitches and gaps in playback are observable due to the remainder of Float32Array being filled with 0's, or alternatively, when length set is greater than 128 a

RangeError: offset is out of bounds

is thrown - which happens to occur when Disable cache is checked at Network tab at DevTools.

guest271314 · 2020-03-22T19:45:05Z

Another option to handle arbitrary Uint8Array length at read() is to process exactly 1 byte at time.

lemanschik · 2023-11-09T05:38:55Z

Isn't this enough?

new Response(
  Uint8Array.from(new Array(1024)).buffer
).body.pipeThrough(
  new TransformStream({ type: "byte", transform: (chunk,c) => chunk.map((byte)=>c.enqueue(byte||"1"))})
).pipeThrough(
  new TransformStream({ type: "byte" },new ByteLengthQueuingStrategy({ highWaterMark: 512,size:(c)=> console.log(512,c) || 512 }))
).pipeTo(new WritableStream({ write(c) { console.log(c.length) }}))

i just ask for a frind :)

why isn't the default behavior that TransformStream which is a passthrough stream converts it to byte

twiss mentioned this issue May 28, 2020

Configure Reader stream chunk size of 64kB openpgpjs/web-stream-tools#6

Closed

hmdnprks mentioned this issue Aug 15, 2022

Presence of middleware prevents access to raw request bodies greater than or equal to 16,384 bytes (16 KiB) vercel/next.js#39262

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define chunk size for `ReadableStream` created by `blob::stream()` #144

Define chunk size for `ReadableStream` created by `blob::stream()` #144

kevin-matthew commented Jan 3, 2020

annevk commented Jan 3, 2020

kevin-matthew commented Jan 4, 2020

annevk commented Jan 5, 2020

ricea commented Jan 6, 2020

annevk commented Jan 6, 2020

kevin-matthew commented Jan 6, 2020

domenic commented Jan 6, 2020

annevk commented Jan 6, 2020

guest271314 commented Mar 22, 2020

guest271314 commented Mar 22, 2020

guest271314 commented Mar 22, 2020

lemanschik commented Nov 9, 2023 •

edited

Define chunk size for ReadableStream created by blob::stream() #144

Define chunk size for ReadableStream created by blob::stream() #144

Comments

kevin-matthew commented Jan 3, 2020

annevk commented Jan 3, 2020

kevin-matthew commented Jan 4, 2020

annevk commented Jan 5, 2020

ricea commented Jan 6, 2020

annevk commented Jan 6, 2020

kevin-matthew commented Jan 6, 2020

domenic commented Jan 6, 2020

annevk commented Jan 6, 2020

guest271314 commented Mar 22, 2020

guest271314 commented Mar 22, 2020

guest271314 commented Mar 22, 2020

lemanschik commented Nov 9, 2023 • edited

Define chunk size for `ReadableStream` created by `blob::stream()` #144

Define chunk size for `ReadableStream` created by `blob::stream()` #144

lemanschik commented Nov 9, 2023 •

edited