Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

Open
master-maintenance1-peer-connect opened this issue Aug 16, 2023 · 5 comments

Comments

@master-maintenance1-peer-connect
Copy link

Summary

Please provide an implementation of a Transformer that can parse large CSV files in a browser environment.

Motivation

I tried to implement such a CsvParseTransformer.
However, when dealing with large files that contain long data with line breaks in a single column and put pressure on the browser’s memory,
It was necessary to use internal APIs not exported by csv-parse, as shown below.
This raised concerns about future functionality.
And, I was unable to apply the AbortController to the CsvParseTransformer and could not implement the purging of its internal buffer.
also,I was unable to implement the csv-parse compliant procedure for interrupting the CsvTransformAPI as AbortController .

import { transform as CsvTransformAPI } from ‘…/…/node_modules/csv-parse/lib/api/index.js’

Alternative

Therefore, I would like you to provide a Transformer that is guaranteed to work continuously.

** CsvParseTransformer Draft**

import { transform as CsvTransformAPI } from '../../node_modules/csv-parse/lib/api/index.js'
import { CsvError } from '../../node_modules/csv-parse/lib/api/CsvError.js';
import type { Options as csvParseOptions, Info as csvParseInfo } from 'csv-parse';

export type ICsvBufferTransformData = { info: csvParseInfo, record: string[] };
export class CsvParseTransformer implements Transformer<Uint8Array, ICsvBufferTransformData>{
    csvTransform: { // @TODO: csv-parse/lib/api transform.parseの内部仕様を 直接使用していて、危うい
        parse: (nextBuf: Uint8Array | undefined, isEnd: boolean,
            pushFunc: (data: ICsvBufferTransformData) => void,
            closeFunc: () => void) => CsvError | Error | undefined
    };
    constructor(csvParseOption: csvParseOptions) {
        this.csvTransform = CsvTransformAPI(csvParseOption);
    }
    transform(chunk: Uint8Array, controller) {
        // console.log("at CsvParseTransformer.transform", chunk)
        try {
            const err = this.csvTransform.parse(chunk, false, (data: ICsvBufferTransformData) => {
                // console.log("at CsvParseTransformer.transform parsed:", data)
                controller.enqueue(data);
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse catch", err);
            controller.error(err);
        }
    }
    flush(controller) {
        try {
            const err = this.csvTransform.parse(undefined, true, (data: ICsvBufferTransformData) => {
                controller.enqueue(data)
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform last csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform last csvTransform.parse catch", err);
            controller.error(err);
        }
    }
}

** exsample usage of CsvParseTransformer
It is convenient to be able to transfer data(from <input type=file>) to a restful API using pipelining.

export function csvFile_upload2_kintone(file: File,
    csvUpDownTaskConfig1: ICSVUpDownTaskConfig, csvHeaders: string[],
    on_record: (data: { info: csvParseInfo, record: any }, context) => { info: csvParseInfo, record: any },
    on_write: (recordCount: number) => void) {
    const readableStrategy_countQueuing = Kintone_API_records_limit;
    const writableStrategy_highWaterMark = Math.max((csvHeaders?.length || 10) * 255 * readableStrategy_countQueuing, CSV_File_buffer_limit_size);

    const abortController = new AbortController();
    const kintneSvWriter = new kintoneStreamFactory_uploadRecords(csvUpDownTaskConfig1, abortController, on_write);
    const csvFileStream = file.stream();
    return csvFileStream.pipeThrough<ICsvBufferTransformData>(
        new TransformStream(new CsvParseTransformer(({
            autoParseDate: false,
            delimiter: Kintone_SEPARATOR,
            encoding: "utf-8",
            bom: csvUpDownTaskConfig1.csvFileWithBOM,
            escape: csvUpDownTaskConfig1.csvStringEscape,
            trim: true,
            record_delimiter: Kintone_LINE_BREAK, //"\r\n",
            relax_column_count: true,
            relax_quotes: true,
            skip_empty_lines: true,
            max_record_size: writableStrategy_highWaterMark,
            on_record: on_record,
            from: 2,
            columns: false,
            info: true
        }))
            ,
            new ByteLengthQueuingStrategy({ highWaterMark: writableStrategy_highWaterMark }),
            new CountQueuingStrategy({ highWaterMark: readableStrategy_countQueuing })
        ))
        .pipeThrough<KintoneRecordForParameter>(new TransformStream(new csv2KintoneRecordsTransform(csvUpDownTaskConfig1)))
        .pipeThrough<KintoneRecordForParameter[]>(new TransformStream(new CsvRecordBufferingTransformer(Kintone_API_records_limit)))
        .pipeTo(kintneSvWriter.getWriter(), { signal: abortController.signal })
        .finally(() => {
            try {
                csvFileStream.getReader().cancel();
                console.log("at csvFile_upload2_kintone csvFileStream.getReader().cancel() sucess")
            } catch (_) { }
        })
}
@PabloReszczynski
Copy link

PabloReszczynski commented Nov 6, 2023

I think it would a good idea to pass a max_buffer_size configuration parameter to the parser constructor so that we can better control how much memory the parser is allowed to use.
I'm having a similar problem parsing csv files in a worker on edge where there are hard memory limits.

@wdavidw
Copy link
Member

wdavidw commented Nov 6, 2023

What do you mean by max_buffer_size, I don't find any reference to this parameter in the Node.js stream API. Any option passed to the parser is also passed to the underlying stream.

@ermi-ltd
Copy link

ermi-ltd commented Apr 24, 2024

Hi,

I'd like to add a +1 to this as we're encountering a similar issue. We're trying to parse large input files via the ESM streaming API without excessive memory pressure in the browser and hitting issues.

@wdavidw
Copy link
Member

wdavidw commented Apr 24, 2024

Would you be able to share a reproducible script in JS (no TS) ?

@ermi-ltd
Copy link

@wdavidw - I'll work on something when I'm back at my desk for ya.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants