CSV parsing is broken in 5.4.1 #998

rpr-ableton · 2023-04-19T07:05:02Z

Hi everyone,

First of all, big thanks for this great library that's been a big help in my work 🙌

I noticed when going from 5.3.2 to 5.4.1 that some CSV files can't be parsed successfully anymore. I'll attach below a small CSV, a nodejs script to reproduce the issue as well as the output.

example.csv:

colName_2,colName_3,another_col,entity,some_bool,period,date,value_eur
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_02,02/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_11,11/01/22,-12345.67891
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891

script.js:

const papa = require('papaparse');
const fs = require('fs');

const readStream = fs.createReadStream(`${ __dirname }/example.csv`);
return papa.parse(readStream, {
    step: (res, parser) => {
        parser.pause();
        Object.keys(res.data).
            sort().
            forEach((key) => {
                console.log(res.data[key]);
            });
        return parser.resume();
    }, // step
    header: true,
    error: (err) => {
        return console.error(err);
    }, // error
    complete: () => {
        console.log('finished');
    }, // complete
}); // return papa.parse

Output with papaparse 5.4.1:

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
[ '-12345.67891' ]
foo, bar
foo, bar
foo_1, bar"_1,foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar
2022_11
foo_bar_baz_V2
0
bar baz
11/01/22
foo, bar"_1,foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891


 bar"
finished

Let me know if you need any additional information to better capture the issue.

Thanks a bunch.

The text was updated successfully, but these errors were encountered:

pokoli · 2023-04-19T07:07:47Z

Hi @rpr-ableton

Could you please describe which is the issue?

rpr-ableton · 2023-04-19T07:13:23Z

The symptoms are:

some values from the CSV are parsed as arrays containing a single value
some values get _1 appended to them for no apparent reason
the CSV files is not parsed correctly as you can see in the output where multiple values get concatenated together

Maybe to illustrate the point better, here's the output from the same script but using papaparse 5.3.2 :

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
foo_bar_baz_V1
foo, bar
foo, bar
03/01/22
bar baz
2022_03
0
1234.45678
foo_bar_baz_V2
foo, bar
foo, bar
11/01/22
bar baz
2022_11
0
-12345.67891
foo_bar_baz_V2
foo, bar
foo, bar
12/01/22
bar baz
2022_12
0
-12345.67891
finished

pokoli · 2023-04-19T07:22:51Z

Hi @rpr-ableton,

We need to isolate the problems with minimal reproducible cases.
I do not see any reason why a value is converted to an Array. On the other hand the fact of _1 is appended to the value seems related to the duplicate headers feature and should be fixed with 824bbd9

I will need a minimum test scenario in order to be able to fix it.

BTW: We have included some changes in master version which are not included on latest release. Could you please test with the master version?

rpr-ableton · 2023-04-19T07:35:15Z

The issues are the same whether I'm using master or 5.4.1.

This is the minimum CSV file to reproduce all errors I mentioned above:

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891
"foo, bar","foo, bar",-12345.67891

This is one reproduces both the _1 appending and the incorrect parsing (but not the unexpected array one):

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891

I haven't managed to do anything smaller than this to generate the errors. Also it seems that the 2 errors mentioned in the second CSV are tightly connected as I can't reproduce one without the other.

tjstavenger-pnnl · 2023-05-02T21:25:59Z

Any update on being able to address this issue? I've recently run into what sounds like a similar problem where a CSV gets parsed with _1 as the value for some of our values. The same rows with those _1 values have other columns that are read into the wrong fields of the javascript object. We're using papaparse 5.4.0. I'll see if I can downgrade to 5.3.2 in the meantime.

tjstavenger-pnnl · 2023-05-03T13:13:17Z

Follow-up -- downgrading to papaparse 5.3.2 corrected the problem.

tjstavenger-pnnl · 2023-05-03T13:31:08Z

The _1 value strikes me as similar to data issues found in #985.

raukaute · 2023-05-25T11:10:57Z

I am experiencing the same issue. Only occurred when using async iterator on stream. Using listeners works without issues.

ulo · 2023-05-30T13:50:59Z

I also got some data consistency issues with parsing csv files in Node using the streaming mode and using "header: true" to get objects. While it works fine with version 5.3.2, starting from 5.4.0 papaparse would change the content of my parsed objects.

As there is no issue using "header: false" I could narrow it down to the following commit c1cbe16

My guess is, that this code block tries to look for duplicate headers on each new streaming chunk, not just the first line of the first chunk, and thereby modifies not the header but actual data rows...

My solution is to stick with version 5.3.2 for now.

pokoli · 2023-05-30T13:52:34Z

@ulo What you describe should be fixed with 824bbd9 which is already released as 5.4.1

ulo · 2023-05-30T18:34:26Z

Thanks for the hint @pokoli !
Indeed, it is apparently not an issue of chunks streaming into Papaparse. I usually create a ReadStream using fs, and there the default chunk size of 64KB is much bigger than my actual table, so there will be only 1 chunk streaming into Papaparse.

However, after looking more into Papaparse's source code, the issue is more likely with streaming the parsed objects out of Papaparse. I could figure out that the central parse loop runs fine for the first 16 rows, but then gets paused. It resumes when the destination of my pipeline consumed the first 16 objects, but upon resuming the baseIndex is still 0 and therefore the header renaming (which was introduced in version 5.4.0) is happening on the actual data starting from row 17.
The 16 objects correspond to the default highWaterMark: 16 of object streams in Node. And after the first chunk of 16 objects, it seems that they are processed one by one in the pipeline.

Here is a minimal working example, where I build a pipeline with a first function that generates one big string representing a tsv-table of 20 rows with id1 and id2 both set to the row index, and then parsing to objects using papaparse. I consume the pipeline by writing each parsed object to the console:

import { Stream } from "stream"
import * as papa from "papaparse"

async function main() {
  const pipeline = Stream.pipeline(
    async function* () {
      yield ["id1\tid2", ...Array.from({ length: 20 }, (v, i) => `${i}\t${i}`)].join("\n")
    },
    papa.parse(papa.NODE_STREAM_INPUT, { delimiter: "\t", newline: "\n", header: true }),
    (err) => err && console.error(err)
  )

  for await (const entry of pipeline) {
    console.log(entry)
  }
}
main()

While the example produces the expected output in version 5.3.2, starting from version 5.4.0 the output looks like this:

{ id1: '0', id2: '0' }
{ id1: '1', id2: '1' }
{ id1: '2', id2: '2' }
{ id1: '3', id2: '3' }
{ id1: '4', id2: '4' }
{ id1: '5', id2: '5' }
{ id1: '6', id2: '6' }
{ id1: '7', id2: '7' }
{ id1: '8', id2: '8' }
{ id1: '9', id2: '9' }
{ id1: '10', id2: '10' }
{ id1: '11', id2: '11' }
{ id1: '12', id2: '12' }
{ id1: '13', id2: '13' }
{ id1: '14', id2: '14' }
{ id1: '15', id2: '15' }
{ id1: '16', id2: '16_1' }
{ id1: '', id2: '17' }
{ id1: '18', id2: '18_1' }
{ id1: '', id2: '19' }

jlebras · 2023-09-20T20:28:24Z

Csv parsing still broken on NodeJS, using v5.4.1 with header: true.
Wrong number of rows and invalid row objects filled with "_1", "_2", etc.... and concatenated columns that should be distinct values.
The csv file used is valid and parsed correctly using different csv tools.

EDIT: The bug appears when using the stream mode like this:

import { NODE_STREAM_INPUT, parse } from 'papaparse';

// Some read stream returned from fs.createdReadStream
const csvFileReadStream = ...
csvFileReadStream.pipe(parse(NODE_STREAM_INPUT, options))

Using

parse(csvFileReadStream, options)

looks to work fine (though Typescript complains on the 1st arg of the parse function but that's maybe some misconfiguration on our side).

noahw3 · 2023-10-30T23:25:49Z

I'm experiencing exactly what @ulo described above. Streaming on 5.4.1 with header: true, the first 16 rows come back fine. The 17th row and beyond are parsed incorrectly.

Converting the stream to a string and parsing that works as expected.

We were seeing an issue with the CSV being inconsistently parsed incorrectly and seem to be a regression in papaparse@5.4.1. Downgrading the version to fix the issue mholt/PapaParse#998

basememara · 2024-04-02T14:31:51Z

I tried the latest master branch as of now (016effe) and issue still persists, had to revert back to 5.3.2 :(

fPolic mentioned this issue May 25, 2023

Product import bugs medusajs/medusa#4131

Closed

winklerj mentioned this issue Nov 28, 2023

Update package.json blendededge/csv-component-oih#21

Merged

paescuj mentioned this issue Feb 23, 2024

Fix issue where data from CSV imports can be corrupted directus/directus#21535

Merged

basememara mentioned this issue Apr 2, 2024

Incorrect header detection in CSV parsing 5.4.1 #1007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV parsing is broken in 5.4.1 #998

CSV parsing is broken in 5.4.1 #998

rpr-ableton commented Apr 19, 2023

pokoli commented Apr 19, 2023

rpr-ableton commented Apr 19, 2023

pokoli commented Apr 19, 2023 •

edited

rpr-ableton commented Apr 19, 2023 •

edited

tjstavenger-pnnl commented May 2, 2023 •

edited

tjstavenger-pnnl commented May 3, 2023

tjstavenger-pnnl commented May 3, 2023

raukaute commented May 25, 2023

ulo commented May 30, 2023 •

edited

pokoli commented May 30, 2023

ulo commented May 30, 2023

jlebras commented Sep 20, 2023 •

edited

noahw3 commented Oct 30, 2023

basememara commented Apr 2, 2024

CSV parsing is broken in 5.4.1 #998

CSV parsing is broken in 5.4.1 #998

Comments

rpr-ableton commented Apr 19, 2023

pokoli commented Apr 19, 2023

rpr-ableton commented Apr 19, 2023

pokoli commented Apr 19, 2023 • edited

rpr-ableton commented Apr 19, 2023 • edited

tjstavenger-pnnl commented May 2, 2023 • edited

tjstavenger-pnnl commented May 3, 2023

tjstavenger-pnnl commented May 3, 2023

raukaute commented May 25, 2023

ulo commented May 30, 2023 • edited

pokoli commented May 30, 2023

ulo commented May 30, 2023

jlebras commented Sep 20, 2023 • edited

noahw3 commented Oct 30, 2023

basememara commented Apr 2, 2024

pokoli commented Apr 19, 2023 •

edited

rpr-ableton commented Apr 19, 2023 •

edited

tjstavenger-pnnl commented May 2, 2023 •

edited

ulo commented May 30, 2023 •

edited

jlebras commented Sep 20, 2023 •

edited