Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is it possible to return raw record even for error rows in csv-parse? #292

Open
hemanthreddyk opened this issue Sep 6, 2021 · 5 comments

Comments

@hemanthreddyk
Copy link

hemanthreddyk commented Sep 6, 2021

Usecase
We want to collect raw record and error reason for all the error rows and form a new CSV file out of it.
By giving this CSV file to the end user they can refer the error message and correct the rows and re-upload the same file.

Whats possible today?
'skip_lines_with_error: true' -- we are able to gather the row numbers of all the faulty records.
'raw: true' -- gives raw record only for valid rows

Current implementation emits a 'skip' event as soon as it finds an error.
Is it possible to wait for the iterator to reach the end of the row and emit the event along with raw record?
@wdavidw Do you see any other way to achieve this use-case with the current implementation?

@hemanthreddyk
Copy link
Author

I did achieve this by making few changes in the code.
Idea is to push all the errors into an array lets say 'recordErrors' which will be maintained in the state object.

__error(msg){
  const {skip_lines_with_error} = this.options
  const err = typeof msg === 'string' ? new Error(msg) : msg
  if(skip_lines_with_error){
    this.state.recordHasError = true
    this.emit('skip', err)
    this.state.recordErrors.push(err)   // push the error into state
    return undefined
  }else{
    return err
  }
}

When the control reaches the end of the corresponding row just emitting out all the errors along with raw record.

__onRecord() {
     .
     .
     .
    if(this.state.recordHasError === true){
      this.__emitRecordErrors()  // emit an event with aggregated row errors
      this.__resetRecord()
      this.state.recordHasError = false
      return
    }
    .
    .
    .
}
__emitRecordErrors() {
  const { raw, encoding } = this.options

  this.emit('aggregatedRowError', Object.assign(
    { errors: this.state.recordErrors },
    raw === true ? {raw: this.state.rawBuffer.toString(encoding)}: {}
  ))
  this.state.recordErrors = []
}

Instead of disturbing the existing skip_lines_with_error functionality, I feel its better to altogether add a new option so that nothing breaks to the existing users.

Do review this and if you feel this is a valid use-case please consider this as an enhancement.

BTW really appreciate you for such a wonderful library, Thanks

@wdavidw
Copy link
Member

wdavidw commented Sep 30, 2021

Could you provide one or multiple test case reproducing what you expect, prepare them as simple as possible, it will help me to garanty I understand the case correctly.

@hemanthreddyk
Copy link
Author

hemanthreddyk commented Sep 30, 2021

Input File

Name,Id,Gender
Hydrogen,1,M
Helium,2,M,test
"Lithion",3,"F"
"Beryllium",""","F"
"Boron",5,"M"
Carbon",6,"F"

Parser Options

{
  columns: true,
  skip_lines_with_error: true,
  raw: true
}

When skip_lines_with_error is set library emits a skip event from which we get to know the row number of the faulty record. We had an ask from the users that instead of just returning row numbers why not give us a csv file as output which contains row number, error reason and the corresponding raw record of all the faulty rows.
Something like

Output file with faulty rows

Row Number,Error Message,Name,Id,Gender
3,"CSV_RECORD_DONT_MATCH_COLUMNS_LENGTH",Helium,3,M,test
5,"CSV_INVALID_CLOSING_QUOTE -- CSV_RECORD_DONT_MATCH_COLUMNS_LENGTH","Beryllium",""","F"
7,"INVALID_OPENING_QUOTE",Carbon",6,"F"

This would save them from manually going through the huge input file to figure out the faulty rows based on row number and also they can just correct the raw record referring the error reason and re-upload it again.

To create such a file we need to have access to the raw record of faulty rows, but currently library is not providing it.

@wdavidw
Copy link
Member

wdavidw commented Oct 21, 2021

It will be shipped with the next major version which is being prepared.

@hemanthreddyk
Copy link
Author

@wdavidw Thank you very much for releasing this feature.
However this only solves half my usecase.
As I previously mentioned above
Current implementation emits a 'skip' event as soon as it finds an error.
If there are multiple errors in the same row, multiple skip events are emitted each with different raw buffers

const { parse } = require('csv-parse')

const parser = parse({
  columns: true,
  skip_records_with_error: true,
  raw: true
})
parser.on( 'skip', function(error, raw){
  console.log(error.lines, error.raw)
})
parser.write(`
Name,Id,Gender
"Sara",""","F"
Jack",8,"M"
`.trim())
parser.end()

If you run the above snippet output looks like

2 "Sara",""","
2 "Sara",""","F"

3 Jack"

As you can notice in the output the raw record for row-3 is only partial.

The idea that I propose is to collect all the errors in a given row and emit a new event lets say 'aggregatedRowError' at the end of the row.
Added my snippet in the above comments please refer for more clarity.

With this change at the time of emitting the event we would have complete raw buffer of the entire row.

@wdavidw wdavidw reopened this Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants