Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

master-maintenance1-peer-connect · 2023-08-16T10:15:04Z

Summary

Please provide an implementation of a Transformer that can parse large CSV files in a browser environment.

Motivation

I tried to implement such a CsvParseTransformer.
However, when dealing with large files that contain long data with line breaks in a single column and put pressure on the browser’s memory,
It was necessary to use internal APIs not exported by csv-parse, as shown below.
This raised concerns about future functionality.
And, I was unable to apply the AbortController to the CsvParseTransformer and could not implement the purging of its internal buffer.
also,I was unable to implement the csv-parse compliant procedure for interrupting the CsvTransformAPI as AbortController .

import { transform as CsvTransformAPI } from ‘…/…/node_modules/csv-parse/lib/api/index.js’

Alternative

Therefore, I would like you to provide a Transformer that is guaranteed to work continuously.

** CsvParseTransformer Draft**

import { transform as CsvTransformAPI } from '../../node_modules/csv-parse/lib/api/index.js'
import { CsvError } from '../../node_modules/csv-parse/lib/api/CsvError.js';
import type { Options as csvParseOptions, Info as csvParseInfo } from 'csv-parse';

export type ICsvBufferTransformData = { info: csvParseInfo, record: string[] };
export class CsvParseTransformer implements Transformer<Uint8Array, ICsvBufferTransformData>{
    csvTransform: { // @TODO: csv-parse/lib/api transform.parseの内部仕様を　直接使用していて、危うい
        parse: (nextBuf: Uint8Array | undefined, isEnd: boolean,
            pushFunc: (data: ICsvBufferTransformData) => void,
            closeFunc: () => void) => CsvError | Error | undefined
    };
    constructor(csvParseOption: csvParseOptions) {
        this.csvTransform = CsvTransformAPI(csvParseOption);
    }
    transform(chunk: Uint8Array, controller) {
        // console.log("at CsvParseTransformer.transform", chunk)
        try {
            const err = this.csvTransform.parse(chunk, false, (data: ICsvBufferTransformData) => {
                // console.log("at CsvParseTransformer.transform parsed:", data)
                controller.enqueue(data);
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse catch", err);
            controller.error(err);
        }
    }
    flush(controller) {
        try {
            const err = this.csvTransform.parse(undefined, true, (data: ICsvBufferTransformData) => {
                controller.enqueue(data)
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform last csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform last csvTransform.parse catch", err);
            controller.error(err);
        }
    }
}

** exsample usage of CsvParseTransformer
It is convenient to be able to transfer data(from <input type=file>) to a restful API using pipelining.

export function csvFile_upload2_kintone(file: File,
    csvUpDownTaskConfig1: ICSVUpDownTaskConfig, csvHeaders: string[],
    on_record: (data: { info: csvParseInfo, record: any }, context) => { info: csvParseInfo, record: any },
    on_write: (recordCount: number) => void) {
    const readableStrategy_countQueuing = Kintone_API_records_limit;
    const writableStrategy_highWaterMark = Math.max((csvHeaders?.length || 10) * 255 * readableStrategy_countQueuing, CSV_File_buffer_limit_size);

    const abortController = new AbortController();
    const kintneSvWriter = new kintoneStreamFactory_uploadRecords(csvUpDownTaskConfig1, abortController, on_write);
    const csvFileStream = file.stream();
    return csvFileStream.pipeThrough<ICsvBufferTransformData>(
        new TransformStream(new CsvParseTransformer(({
            autoParseDate: false,
            delimiter: Kintone_SEPARATOR,
            encoding: "utf-8",
            bom: csvUpDownTaskConfig1.csvFileWithBOM,
            escape: csvUpDownTaskConfig1.csvStringEscape,
            trim: true,
            record_delimiter: Kintone_LINE_BREAK, //"\r\n",
            relax_column_count: true,
            relax_quotes: true,
            skip_empty_lines: true,
            max_record_size: writableStrategy_highWaterMark,
            on_record: on_record,
            from: 2,
            columns: false,
            info: true
        }))
            ,
            new ByteLengthQueuingStrategy({ highWaterMark: writableStrategy_highWaterMark }),
            new CountQueuingStrategy({ highWaterMark: readableStrategy_countQueuing })
        ))
        .pipeThrough<KintoneRecordForParameter>(new TransformStream(new csv2KintoneRecordsTransform(csvUpDownTaskConfig1)))
        .pipeThrough<KintoneRecordForParameter[]>(new TransformStream(new CsvRecordBufferingTransformer(Kintone_API_records_limit)))
        .pipeTo(kintneSvWriter.getWriter(), { signal: abortController.signal })
        .finally(() => {
            try {
                csvFileStream.getReader().cancel();
                console.log("at csvFile_upload2_kintone csvFileStream.getReader().cancel() sucess")
            } catch (_) { }
        })
}

The text was updated successfully, but these errors were encountered:

PabloReszczynski · 2023-11-06T15:41:26Z

I think it would a good idea to pass a max_buffer_size configuration parameter to the parser constructor so that we can better control how much memory the parser is allowed to use.
I'm having a similar problem parsing csv files in a worker on edge where there are hard memory limits.

wdavidw · 2023-11-06T18:14:44Z

What do you mean by max_buffer_size, I don't find any reference to this parameter in the Node.js stream API. Any option passed to the parser is also passed to the underlying stream.

ermi-ltd · 2024-04-24T14:39:53Z

Hi,

I'd like to add a +1 to this as we're encountering a similar issue. We're trying to parse large input files via the ESM streaming API without excessive memory pressure in the browser and hitting issues.

wdavidw · 2024-04-24T15:11:38Z

Would you be able to share a reproducible script in JS (no TS) ?

ermi-ltd · 2024-04-24T15:16:23Z

@wdavidw - I'll work on something when I'm back at my desk for ya.

master-maintenance1-peer-connect added the enhancement label Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

master-maintenance1-peer-connect commented Aug 16, 2023 •

edited

PabloReszczynski commented Nov 6, 2023 •

edited

wdavidw commented Nov 6, 2023

ermi-ltd commented Apr 24, 2024 •

edited

wdavidw commented Apr 24, 2024

ermi-ltd commented Apr 24, 2024

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment #398

Comments

master-maintenance1-peer-connect commented Aug 16, 2023 • edited

PabloReszczynski commented Nov 6, 2023 • edited

wdavidw commented Nov 6, 2023

ermi-ltd commented Apr 24, 2024 • edited

wdavidw commented Apr 24, 2024

ermi-ltd commented Apr 24, 2024

master-maintenance1-peer-connect commented Aug 16, 2023 •

edited

PabloReszczynski commented Nov 6, 2023 •

edited

ermi-ltd commented Apr 24, 2024 •

edited