Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper way to handle the return value of push() in the _transform() implementation of a transform stream #1791

Closed
kimamula opened this issue Feb 23, 2019 · 9 comments

Comments

@kimamula
Copy link

  • Node.js Version: any
  • OS: any
  • Scope (install, code, runtime, meta, other?): code
  • Module (and version) (if relevant): stream

The document for transform._transform() says that "The transform.push() method may be called zero or more times" in it.

The backpressure document says that you must respect the return value of .push() and should stop calling it when it returns false.

However, it is not clear what I should do if .push() returns false when I have more data to push in the transform._transform() implementation.

@gireeshpunathil
Copy link
Member

I guess given that transformers are always intermediary streams (connected with neither the source nor the sink), the back pressure management does not come under its purview, instead a considerations for the connected streams at either end of the transformer.

cc @nodejs/streams to get an opinion.

@MHebes
Copy link

MHebes commented Dec 15, 2020

However, it is not clear what I should do if .push() returns false when I have more data to push in the transform._transform() implementation.

You should wait for a "drain" event before pushing the remainder of your data. Ensure you don't get more data by not calling the callback you get in _transform until you're finished pushing.

See this comment for a good example.

@mcollina
Copy link
Member

However, it is not clear what I should do if .push() returns false when I have more data to push in the transform._transform() implementation.

I recommend to just call .push() for all the data that is produced within ._transform()

@brucedjones
Copy link

@mcollina this is directly contradictory to the comment from @MHebes, could you elaborate?

@mcollina
Copy link
Member

@mcollina this is directly contradictory to the comment from @MHebes, could you elaborate?

A transform stream sits between two other streams. Essentially you are moving data: readable -> transform -> writable. Transform implements buffering on both sides. You have data coming in from readable inside _transform(), the alternative to using .push() is to buffer manually: however why implement one more level of buffering? Just use the one provided by node core. The return value of .push() is for Readable direct implementors, not Transform users.

I don't know how much there is to elaborate here. FYI, I'm one of the maintainers of streams themselves.

@sfriesel
Copy link

however why implement one more level of buffering?

@mcollina Because a transform does not just move data, it can also inflate it. A decompressor can increase the amount of data by orders of magnitude, so by having the writing side control the flow, backpressure control goes out the window. See for example this bug: regular/unbzip2-stream#17. The transformer buffers ~1MB of input to be sure to get at least one compressed, full block. But when the stream is ending that last megabyte may turn out to be hundreds of MB worth of compressed data, all of which has to be pushed.

@sfriesel
Copy link

Or as another example: decompression bomb detection. The internal zlib transformer gained the ability to limit output size (nodejs/node@27253, nodejs/node#33516). If backpressure on the Readable side was properly bounded, the consumer of any type of Transform stream could protect itself against decompression bombs. In zlib the maximum compression ratio is ~1:1000; it is way higher in more effective compression formats.

@mcollina
Copy link
Member

@sfriesel I stand by my recommendation in the generic case. Unless you have some very specific problem you want to solve, stay away from adding another level of buffering because you are likely to implement it wrongly. In the few cases this is desperately needed, there is .unshift() to push back data upstream and avoid implementing yet another level of buffering.

Given your expertise on the subject, would you like to work on a few PRs to help with this?

@dobesv
Copy link

dobesv commented Dec 6, 2023

Another case where this seems to be a problem: adaltas/node-csv#408

It doesn't seem like there's actually any way to implement the back pressure yourself without reimplementing Transform completely.

I wonder if transform could somehow be augmented to call _transform again if it previously returned false from push. It would have to work after _flush as well somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants