Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with ignore_last_delimiters #330

Open
SystemParadox opened this issue Mar 22, 2022 · 9 comments
Open

Issues with ignore_last_delimiters #330

SystemParadox opened this issue Mar 22, 2022 · 9 comments

Comments

@SystemParadox
Copy link

SystemParadox commented Mar 22, 2022

Describe the bug

ignore_last_delimiters: true seems to have some issues:

  1. If you don't use relax_quotes: true then it fails with CsvError$1: Invalid Closing Quote: got "," at line 4 instead of delimiter, record delimiter, trimable character (if activated) or comment
  2. If you use trim: true or rtrim: true then it fails with CsvError$1: Invalid Closing Quote: found non trimable byte after quote at line 4
  3. If you avoid the two issues above then the final string ends up with extra quotes and the final delimiter, e.g. "'123123-12312312", (the double quotes are part of the string)

To Reproduce

const fs = require('fs');
const csv = require('csv');

const INPUT = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const stream = csv.parse(INPUT, {
    skip_empty_lines: true,
    ltrim: true,
    rtrim: true,
    columns: true,
    relax_quotes: true,
    ignore_last_delimiters: true,
});

Additional context

  • node version: 16.14.0
  • csv version: 6.0.5
@wdavidw
Copy link
Member

wdavidw commented Mar 23, 2022

Looking at your input, it seems like a combinaison of skip_empty_lines and relax_column_count will do, see my sample code:

const {parse} = require('csv/sync');

const input = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const output = parse(input, {
  skip_empty_lines: true,
  relax_column_count: true
});

@SystemParadox
Copy link
Author

Thanks, that's a better workaround as it avoids issue (3).

@wdavidw wdavidw closed this as completed Mar 23, 2022
@SystemParadox
Copy link
Author

Wait, I don't think this should be closed, it's just a workaround. As documented, ignore_last_delimiters is the correct solution and it's still bugged in various ways.

@wdavidw wdavidw reopened this Mar 24, 2022
@wdavidw
Copy link
Member

wdavidw commented Apr 5, 2022

I confirm that we don't handle yet quotes with ignore_last_delimiters and it generates an unexpected error. Looking at how to implement this, there are divergent ways to handle it, does a,"b",c,"d" leads to ["a","b,c,d"] or ["a","b\",c,\"d"]. I am not yet confident which one is the best, here are my thoughts:

  • First one is already handled by relax_column_count, that what it is made for. Agreed that it generates multiple fields and not one, ["a","b","c","d"] instead of ["a","b,c,d"] but this is easily handled post-parsing.
  • The ignore_last_delimiters is here to support SSA, no idea how/if SSA support quotes but if it does, we shall probably handle quotes the same way. However, I haven't find a tool to edit/generate SSA files yet.

@SystemParadox
Copy link
Author

The confusion/complication here seems to be trying to use ignore_last_delimiters to support this SSA subtitles case. If I say "ignore last delimiters", I expect them to be ignored, not included as part of the last column.

I would strongly suggest that this behaviour is moved to a separate option, or as a string option to ignore_last_delimiters like this:

  • ignore_last_delimiters: false (default behaviour)
  • ignore_last_delimiters: true (ignore any trailing delimiters beyond the expected column count)
  • ignore_last_delimiters: 'include-in-last-column'

Additionally, I should point out that relax_column_count is not desirable for this case - I want a strict column count as regards to data, but I just want to ignore any trailing delimiters. If there is data past the trailing delimiter then it's not a trailing delimiter it's an extra column and I'm expecting an invalid column count error.

@SystemParadox
Copy link
Author

Thinking about this a bit further, I would suggest deprecating ignore_last_delimiters and creating these two options:

  • ignore_trailing_delimiters (to strip off trailing delimiters)
  • merge_extra_columns (to support the SSA use case)

@wdavidw
Copy link
Member

wdavidw commented Apr 5, 2022

Speaking about ignore_trailing_delimiters:

  1. what you are suggesting is a,"b",c,"d" yields to ["a","b,c,d"], right ?
  2. If so, I am concerned about how to handle a,"b""c","d,e". The first field contains escaped quotes, the second one contains a field delimiter.

@SystemParadox
Copy link
Author

No that's exactly what ignore_trailing_delimiters shouldn't do. Trailing delimeters are trailing delimiters only. a,"b",c,"d" doesn't have any trailing delimiters because all delimiters are followed by other characters. Trailing delimiters means a,b,c,d, or a,b,c,d,,,,,.

This option should never affect the parsing/escaping of quotes and it should never cause data to be ignored. It only affects how many empty string values are produced by trailing commas at the end of a row.

a,"b""c","d,e" is ['a','b"c','d,e'] and should always be so unless you've changed the quote characters or disabled escaped quotes. In my opinion there shouldn't be any other options that change this behaviour. It's confusing, it's ambiguous and I sure hope nobody ever needs or wants this - even the SSA parsing can be done in a much simpler and safer way than this.

To be clear, this is what ignore_trailing_delimiters should do:

Example 1

// input
A,B,C,D
a,b,c,d,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 2 (any number of extra trailing delimiters are ignored)

// input
A,B,C,D
a,b,c,d,,,,,,,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 3 (ignore trailing delimiters in the column definition)

// input
A,B,C,D,
a,b,c,d
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 4 (too many columns should still be an error)

// input
A,B,C,D
a,b,c,d,e
// output
Error: Row 1: Expected 4 columns, got 5

Example 5 (keep trailing delimiters up to the expected number of columns)

// input
A,B,C,D
a,b,c,
// output
{ A: 'a', B: 'b', C: 'c', D: '' }

Example 6 (too few columns should still be an error)

// input
A,B,C,D
a,b,c
// output
Error: Row 1: Expected 4 columns, got 3

@wdavidw
Copy link
Member

wdavidw commented Apr 6, 2022

OK, this is much more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants