Issues with ignore_last_delimiters #330

SystemParadox · 2022-03-22T22:52:10Z

Describe the bug

ignore_last_delimiters: true seems to have some issues:

If you don't use relax_quotes: true then it fails with CsvError$1: Invalid Closing Quote: got "," at line 4 instead of delimiter, record delimiter, trimable character (if activated) or comment
If you use trim: true or rtrim: true then it fails with CsvError$1: Invalid Closing Quote: found non trimable byte after quote at line 4
If you avoid the two issues above then the final string ends up with extra quotes and the final delimiter, e.g. "'123123-12312312", (the double quotes are part of the string)

To Reproduce

const fs = require('fs');
const csv = require('csv');

const INPUT = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const stream = csv.parse(INPUT, {
    skip_empty_lines: true,
    ltrim: true,
    rtrim: true,
    columns: true,
    relax_quotes: true,
    ignore_last_delimiters: true,
});

Additional context

node version: 16.14.0
csv version: 6.0.5

The text was updated successfully, but these errors were encountered:

wdavidw · 2022-03-23T08:39:29Z

Looking at your input, it seems like a combinaison of skip_empty_lines and relax_column_count will do, see my sample code:

const {parse} = require('csv/sync');

const input = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const output = parse(input, {
  skip_empty_lines: true,
  relax_column_count: true
});

SystemParadox · 2022-03-23T13:03:25Z

Thanks, that's a better workaround as it avoids issue (3).

SystemParadox · 2022-03-23T23:49:25Z

Wait, I don't think this should be closed, it's just a workaround. As documented, ignore_last_delimiters is the correct solution and it's still bugged in various ways.

wdavidw · 2022-04-05T07:39:44Z

I confirm that we don't handle yet quotes with ignore_last_delimiters and it generates an unexpected error. Looking at how to implement this, there are divergent ways to handle it, does a,"b",c,"d" leads to ["a","b,c,d"] or ["a","b\",c,\"d"]. I am not yet confident which one is the best, here are my thoughts:

First one is already handled by relax_column_count, that what it is made for. Agreed that it generates multiple fields and not one, ["a","b","c","d"] instead of ["a","b,c,d"] but this is easily handled post-parsing.
The ignore_last_delimiters is here to support SSA, no idea how/if SSA support quotes but if it does, we shall probably handle quotes the same way. However, I haven't find a tool to edit/generate SSA files yet.

SystemParadox · 2022-04-05T17:08:36Z

The confusion/complication here seems to be trying to use ignore_last_delimiters to support this SSA subtitles case. If I say "ignore last delimiters", I expect them to be ignored, not included as part of the last column.

I would strongly suggest that this behaviour is moved to a separate option, or as a string option to ignore_last_delimiters like this:

ignore_last_delimiters: false (default behaviour)
ignore_last_delimiters: true (ignore any trailing delimiters beyond the expected column count)
ignore_last_delimiters: 'include-in-last-column'

Additionally, I should point out that relax_column_count is not desirable for this case - I want a strict column count as regards to data, but I just want to ignore any trailing delimiters. If there is data past the trailing delimiter then it's not a trailing delimiter it's an extra column and I'm expecting an invalid column count error.

SystemParadox · 2022-04-05T17:12:40Z

Thinking about this a bit further, I would suggest deprecating ignore_last_delimiters and creating these two options:

ignore_trailing_delimiters (to strip off trailing delimiters)
merge_extra_columns (to support the SSA use case)

wdavidw · 2022-04-05T20:00:56Z

Speaking about ignore_trailing_delimiters:

what you are suggesting is a,"b",c,"d" yields to ["a","b,c,d"], right ?
If so, I am concerned about how to handle a,"b""c","d,e". The first field contains escaped quotes, the second one contains a field delimiter.

SystemParadox · 2022-04-05T20:47:04Z

No that's exactly what ignore_trailing_delimiters shouldn't do. Trailing delimeters are trailing delimiters only. a,"b",c,"d" doesn't have any trailing delimiters because all delimiters are followed by other characters. Trailing delimiters means a,b,c,d, or a,b,c,d,,,,,.

This option should never affect the parsing/escaping of quotes and it should never cause data to be ignored. It only affects how many empty string values are produced by trailing commas at the end of a row.

a,"b""c","d,e" is ['a','b"c','d,e'] and should always be so unless you've changed the quote characters or disabled escaped quotes. In my opinion there shouldn't be any other options that change this behaviour. It's confusing, it's ambiguous and I sure hope nobody ever needs or wants this - even the SSA parsing can be done in a much simpler and safer way than this.

To be clear, this is what ignore_trailing_delimiters should do:

Example 1

// input
A,B,C,D
a,b,c,d,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 2 (any number of extra trailing delimiters are ignored)

// input
A,B,C,D
a,b,c,d,,,,,,,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 3 (ignore trailing delimiters in the column definition)

// input
A,B,C,D,
a,b,c,d
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 4 (too many columns should still be an error)

// input
A,B,C,D
a,b,c,d,e
// output
Error: Row 1: Expected 4 columns, got 5

Example 5 (keep trailing delimiters up to the expected number of columns)

// input
A,B,C,D
a,b,c,
// output
{ A: 'a', B: 'b', C: 'c', D: '' }

Example 6 (too few columns should still be an error)

// input
A,B,C,D
a,b,c
// output
Error: Row 1: Expected 4 columns, got 3

wdavidw · 2022-04-06T07:09:49Z

OK, this is much more clear.

wdavidw closed this as completed Mar 23, 2022

wdavidw reopened this Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with ignore_last_delimiters #330

Issues with ignore_last_delimiters #330

SystemParadox commented Mar 22, 2022 •

edited

wdavidw commented Mar 23, 2022

SystemParadox commented Mar 23, 2022

SystemParadox commented Mar 23, 2022

wdavidw commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

wdavidw commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

wdavidw commented Apr 6, 2022

Issues with ignore_last_delimiters #330

Issues with ignore_last_delimiters #330

Comments

SystemParadox commented Mar 22, 2022 • edited

wdavidw commented Mar 23, 2022

SystemParadox commented Mar 23, 2022

SystemParadox commented Mar 23, 2022

wdavidw commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

wdavidw commented Apr 5, 2022

SystemParadox commented Apr 5, 2022

Example 1

Example 2 (any number of extra trailing delimiters are ignored)

Example 3 (ignore trailing delimiters in the column definition)

Example 4 (too many columns should still be an error)

Example 5 (keep trailing delimiters up to the expected number of columns)

Example 6 (too few columns should still be an error)

wdavidw commented Apr 6, 2022

SystemParadox commented Mar 22, 2022 •

edited