Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV parsing is broken in 5.4.1 #998

Open
rpr-ableton opened this issue Apr 19, 2023 · 14 comments
Open

CSV parsing is broken in 5.4.1 #998

rpr-ableton opened this issue Apr 19, 2023 · 14 comments

Comments

@rpr-ableton
Copy link

Hi everyone,

First of all, big thanks for this great library that's been a big help in my work 馃檶

I noticed when going from 5.3.2 to 5.4.1 that some CSV files can't be parsed successfully anymore. I'll attach below a small CSV, a nodejs script to reproduce the issue as well as the output.

example.csv:

colName_2,colName_3,another_col,entity,some_bool,period,date,value_eur
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_02,02/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_11,11/01/22,-12345.67891
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891

script.js:

const papa = require('papaparse');
const fs = require('fs');

const readStream = fs.createReadStream(`${ __dirname }/example.csv`);
return papa.parse(readStream, {
    step: (res, parser) => {
        parser.pause();
        Object.keys(res.data).
            sort().
            forEach((key) => {
                console.log(res.data[key]);
            });
        return parser.resume();
    }, // step
    header: true,
    error: (err) => {
        return console.error(err);
    }, // error
    complete: () => {
        console.log('finished');
    }, // complete
}); // return papa.parse

Output with papaparse 5.4.1:

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
[ '-12345.67891' ]
foo, bar
foo, bar
foo_1, bar"_1,foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar
2022_11
foo_bar_baz_V2
0
bar baz
11/01/22
foo, bar"_1,foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891


 bar"
finished

Let me know if you need any additional information to better capture the issue.

Thanks a bunch.

@pokoli
Copy link
Collaborator

pokoli commented Apr 19, 2023

Hi @rpr-ableton

Could you please describe which is the issue?

@rpr-ableton
Copy link
Author

The symptoms are:

  • some values from the CSV are parsed as arrays containing a single value
  • some values get _1 appended to them for no apparent reason
  • the CSV files is not parsed correctly as you can see in the output where multiple values get concatenated together

Maybe to illustrate the point better, here's the output from the same script but using papaparse 5.3.2 :

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
foo_bar_baz_V1
foo, bar
foo, bar
03/01/22
bar baz
2022_03
0
1234.45678
foo_bar_baz_V2
foo, bar
foo, bar
11/01/22
bar baz
2022_11
0
-12345.67891
foo_bar_baz_V2
foo, bar
foo, bar
12/01/22
bar baz
2022_12
0
-12345.67891
finished

@pokoli
Copy link
Collaborator

pokoli commented Apr 19, 2023

Hi @rpr-ableton,

We need to isolate the problems with minimal reproducible cases.
I do not see any reason why a value is converted to an Array. On the other hand the fact of _1 is appended to the value seems related to the duplicate headers feature and should be fixed with 824bbd9

I will need a minimum test scenario in order to be able to fix it.

BTW: We have included some changes in master version which are not included on latest release. Could you please test with the master version?

@rpr-ableton
Copy link
Author

rpr-ableton commented Apr 19, 2023

The issues are the same whether I'm using master or 5.4.1.

This is the minimum CSV file to reproduce all errors I mentioned above:

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891
"foo, bar","foo, bar",-12345.67891

This is one reproduces both the _1 appending and the incorrect parsing (but not the unexpected array one):

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891

I haven't managed to do anything smaller than this to generate the errors. Also it seems that the 2 errors mentioned in the second CSV are tightly connected as I can't reproduce one without the other.

@tjstavenger-pnnl
Copy link

tjstavenger-pnnl commented May 2, 2023

Any update on being able to address this issue? I've recently run into what sounds like a similar problem where a CSV gets parsed with _1 as the value for some of our values. The same rows with those _1 values have other columns that are read into the wrong fields of the javascript object. We're using papaparse 5.4.0. I'll see if I can downgrade to 5.3.2 in the meantime.

@tjstavenger-pnnl
Copy link

Follow-up -- downgrading to papaparse 5.3.2 corrected the problem.

@tjstavenger-pnnl
Copy link

The _1 value strikes me as similar to data issues found in #985.

@raukaute
Copy link

I am experiencing the same issue. Only occurred when using async iterator on stream. Using listeners works without issues.

@ulo
Copy link

ulo commented May 30, 2023

I also got some data consistency issues with parsing csv files in Node using the streaming mode and using "header: true" to get objects. While it works fine with version 5.3.2, starting from 5.4.0 papaparse would change the content of my parsed objects.

As there is no issue using "header: false" I could narrow it down to the following commit c1cbe16

My guess is, that this code block tries to look for duplicate headers on each new streaming chunk, not just the first line of the first chunk, and thereby modifies not the header but actual data rows...

My solution is to stick with version 5.3.2 for now.

@pokoli
Copy link
Collaborator

pokoli commented May 30, 2023

@ulo What you describe should be fixed with 824bbd9 which is already released as 5.4.1

@ulo
Copy link

ulo commented May 30, 2023

Thanks for the hint @pokoli !
Indeed, it is apparently not an issue of chunks streaming into Papaparse. I usually create a ReadStream using fs, and there the default chunk size of 64KB is much bigger than my actual table, so there will be only 1 chunk streaming into Papaparse.

However, after looking more into Papaparse's source code, the issue is more likely with streaming the parsed objects out of Papaparse. I could figure out that the central parse loop runs fine for the first 16 rows, but then gets paused. It resumes when the destination of my pipeline consumed the first 16 objects, but upon resuming the baseIndex is still 0 and therefore the header renaming (which was introduced in version 5.4.0) is happening on the actual data starting from row 17.
The 16 objects correspond to the default highWaterMark: 16 of object streams in Node. And after the first chunk of 16 objects, it seems that they are processed one by one in the pipeline.

Here is a minimal working example, where I build a pipeline with a first function that generates one big string representing a tsv-table of 20 rows with id1 and id2 both set to the row index, and then parsing to objects using papaparse. I consume the pipeline by writing each parsed object to the console:

import { Stream } from "stream"
import * as papa from "papaparse"

async function main() {
  const pipeline = Stream.pipeline(
    async function* () {
      yield ["id1\tid2", ...Array.from({ length: 20 }, (v, i) => `${i}\t${i}`)].join("\n")
    },
    papa.parse(papa.NODE_STREAM_INPUT, { delimiter: "\t", newline: "\n", header: true }),
    (err) => err && console.error(err)
  )

  for await (const entry of pipeline) {
    console.log(entry)
  }
}
main()

While the example produces the expected output in version 5.3.2, starting from version 5.4.0 the output looks like this:

{ id1: '0', id2: '0' }
{ id1: '1', id2: '1' }
{ id1: '2', id2: '2' }
{ id1: '3', id2: '3' }
{ id1: '4', id2: '4' }
{ id1: '5', id2: '5' }
{ id1: '6', id2: '6' }
{ id1: '7', id2: '7' }
{ id1: '8', id2: '8' }
{ id1: '9', id2: '9' }
{ id1: '10', id2: '10' }
{ id1: '11', id2: '11' }
{ id1: '12', id2: '12' }
{ id1: '13', id2: '13' }
{ id1: '14', id2: '14' }
{ id1: '15', id2: '15' }
{ id1: '16', id2: '16_1' }
{ id1: '', id2: '17' }
{ id1: '18', id2: '18_1' }
{ id1: '', id2: '19' }

@jlebras
Copy link

jlebras commented Sep 20, 2023

Csv parsing still broken on NodeJS, using v5.4.1 with header: true.
Wrong number of rows and invalid row objects filled with "_1", "_2", etc.... and concatenated columns that should be distinct values.
The csv file used is valid and parsed correctly using different csv tools.

EDIT: The bug appears when using the stream mode like this:

import { NODE_STREAM_INPUT, parse } from 'papaparse';

// Some read stream returned from fs.createdReadStream
const csvFileReadStream = ...
csvFileReadStream.pipe(parse(NODE_STREAM_INPUT, options))

Using

parse(csvFileReadStream, options)

looks to work fine (though Typescript complains on the 1st arg of the parse function but that's maybe some misconfiguration on our side).

@noahw3
Copy link

noahw3 commented Oct 30, 2023

I'm experiencing exactly what @ulo described above. Streaming on 5.4.1 with header: true, the first 16 rows come back fine. The 17th row and beyond are parsed incorrectly.

Converting the stream to a string and parsing that works as expected.

winklerj added a commit to blendededge/csv-component-oih that referenced this issue Nov 28, 2023
We were seeing an issue with the CSV being inconsistently parsed incorrectly and seem to be a regression in papaparse@5.4.1.
Downgrading the version to fix the issue mholt/PapaParse#998
@basememara
Copy link

I tried the latest master branch as of now (016effe) and issue still persists, had to revert back to 5.3.2 :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants