Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv-parse stream will process the whole buffer even with back pressure. #408

Open
dobesv opened this issue Dec 6, 2023 · 6 comments
Open

Comments

@dobesv
Copy link

dobesv commented Dec 6, 2023

Describe the bug

Using csv-parse@5.5.2, I have found that if I provide a buffer input or stream to the parser, it will always send every row in a buffer it gets even if there is back-pressure. It will only apply back-pressure between chunks that it receives from its input.

To Reproduce

import assert from 'assert'
import { parse, Parser } from 'csv-parse'
import { pipeline, Readable, Writable, WritableOptions } from 'stream';


// Create the parser
const parser: Parser = parse();
parser.on('data', (row) => {
    console.log('parser row', row, (parser as any)._readableState.length);
})

const bufs = [];
for(let i=0; i < 100000; i++) {
    bufs.push(Buffer.from(`a${i}, b${i}, ${i}\r\n`));
}
const inputBuffer = Buffer.concat(bufs);
const input = Readable.from([inputBuffer]);
input.on('data', (chunk) => {
    console.log('input chunk', chunk.length);
})

class BackpressureWritable extends Writable {
    count: number;
    threshold: number;

    constructor(options: WritableOptions) {
        super(options);
        this.count = 0;
        this.threshold = 10;
    }

    // @ts-ignore
    write(chunk, encoding, callback) {
        const result = super.write(chunk, encoding, callback);
        console.log(`write(${chunk.toString()}) => ${result}`);
        return result;
    }

    _write(chunk: any, encoding: string, callback: any) {
        this.count++;
        console.log(`_write(${chunk.toString()})`);

        setTimeout(callback, this.count); // Simulating delay to handle backpressure
    }
}


const output = new BackpressureWritable({objectMode: true, highWaterMark: 1});
pipeline(input, parser, output, () => {
    console.log('pipeline output');
});

output.on('end', () => {
    console.log('end');
});

If you run the above script you will see that (parser as any)._readableState.length increments to include all rows immediately - all the rows are buffered into the Readable half of the parser.

In some cases a user of the library may want to pass in a buffer of many MB thinking that it will be processed in small batches (say, using the stream water mark). However, with this library all the rows will be processed immediately, using a lot more memory than necessary.

In order to fix this, the library should check the return value of push, and if it is false it should pause parsing even if it has enough data buffered from the input to read another record. I'm not actually sure currently how to know when it OK to call push again, though. The documentation isn't clear on this point.

See also

@dobesv
Copy link
Author

dobesv commented Dec 6, 2023

I messed around with this for a while trying to figure out a solution, but this seems to be a case that just isn't handled by the Transform API. After the last buffer is provided to the transform, it will call _flush and close the stream and you don't get more chances to try to push the remaining data even if there was back pressure. Kind of an odd gap in the API.

Anyway, my workaround for now is to split up the incoming buffers so that they are always small, that seems to work OK. But with this issue lingering I suppose there will be others in the future who are bitten by the same issue.

@wdavidw
Copy link
Member

wdavidw commented Dec 6, 2023

Is it not the responsibility of the Stream Reader (input in your case) to provide smaller chunks ? If someone doesn't control the chunk size of its input, maybe he could insert a custom transformer between the input and the parser. From the parser's standpoint, processing all the data it receives seems fair.

@dobesv
Copy link
Author

dobesv commented Dec 6, 2023

I don't think it's unreasonable to expect that even if you pass the entire large dataset as a string or Buffer that it would still process it in chunks to reduce memory usage. A 300MB Buffer of CSV data could easily use several GB of RAM if you parse all the rows in advance. Putting 300MB in RAM might not be a big deal, but expecting the rows to parse in small batches so the additional memory usage is minimal.

At least, this issue pretty much took me 6 hours to figure out and find a workaround yesterday, so If filed the issue here as I could imagine others being unpleasantly surprised by it.

The workaround isn't super hard to implement once you know it's needed, but it's not very obvious that it should be.

@dobesv
Copy link
Author

dobesv commented Dec 6, 2023

With a bit more research I think maybe to resolve this would require implementing Duplex directly instead of using Transform. Kind of a pain, but the nodejs Transform API doesn't have any built-in concept of one-sided backpressure, it basically assumes the input and output are of a similar size and passes through all backpressure upstream.

@hperrin
Copy link

hperrin commented Feb 19, 2024

In case you're still looking for a solution here, I wrote my own back pressure aware transform stream. You can find it here:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/BackPressureTransform.ts

And here's an example of it being used:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/EncryptionProxyResource.ts#L115

I'm thinking about making this a separate NPM package.

@hperrin
Copy link

hperrin commented Feb 19, 2024

Ok, here it is:

https://www.npmjs.com/package/@sciactive/back-pressure-transform

Try it out and see if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants