Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip lines future (Feature Suggestion) #738 #1021

Merged
merged 1 commit into from Oct 9, 2023

Conversation

bhuvaneshwararaja
Copy link
Contributor

@bhuvaneshwararaja bhuvaneshwararaja commented Sep 24, 2023

Issues No : #738

Description:

In this feature enhancement, Added the capability to skip a defined number of lines when converting a CSV (Comma-Separated Values) file to JSON format. This functionality offers users more flexibility when dealing with CSV files with header information or metadata at the beginning, allowing for a more efficient and precise conversion process.

Changes Made:

  1. Introduced a new configuration parameter, skipFirstNLines, which allows users to specify the number of lines to skip at the beginning of the CSV file during the conversion to JSON. This parameter supports values from 0 and above.
  2. Implemented validation for the skipFirstNLines configuration parameter to ensure that it does not negatively impact the parsing process, even if it is set to 0 or given a negative value.
  3. Enhanced the user interface of the player.html file by adding an input field that allows users to input the desired value for skipFirstNLines.
  4. Updated the player.js script to retrieve and utilize the skipFirstNLines input from the user and incorporate it into the parser's configuration.

Test cases:

Included test cases to validate the functionality of the skipFirstNLines feature, ensuring that it behaves correctly under various scenarios.

Test 1: "Skip First N number of lines , with header",

Test 2: "Skip First N number of lines , without header",

Test 3: "Skip First N lines with value as 0 with header false",

Test 4: "Skip First N lines with value as 0 with header true",

Test 5: "Skip First N lines with value less then 0, with header true",

Test 6: "Without Skip First N Lines config",

@pokoli
Copy link
Collaborator

pokoli commented Sep 25, 2023

What you are proposing is already implemented using the preview config option.

@pokoli pokoli closed this Sep 25, 2023
@bhuvaneshwararaja
Copy link
Contributor Author

Hi @pokoli ,
I have gone through the documentation (https://www.papaparse.com/docs) for the preview options
and got to know that will help to parse the expected number of rows in csv
example: If we have 20 rows in csv, If the preview value is 10 . Out of 20 only 10 rows will be parsed .

But this PR will add a config to skip the N number of rows in the beginning of the csv.
Example: If we have 20 rows in csv , If skipNLines value is 10 . Out of 20 , first 10 rows will be skipped .

Please check it once again , it is relevant to the issue #738 as well.

@pokoli pokoli reopened this Sep 25, 2023
@pokoli
Copy link
Collaborator

pokoli commented Sep 25, 2023

Ok, then we need to document this new flag to explain which is the ussage.

I think the flag will be better named as skipFirstLines.

Could you have a look at the test suite? it seems you introduced some failing tests.

@bhuvaneshwararaja
Copy link
Contributor Author

Hi @pokoli , Thanks
Sure i will check those test suite , I will update the PR with doc

@bhuvaneshwararaja
Copy link
Contributor Author

Hi @pokoli , I have updated the PR

papaparse.js Outdated Show resolved Hide resolved
tests/test-cases.js Outdated Show resolved Hide resolved
tests/test-cases.js Outdated Show resolved Hide resolved
papaparse.js Outdated Show resolved Hide resolved
papaparse.js Show resolved Hide resolved
@pokoli pokoli merged commit e2d570e into mholt:master Oct 9, 2023
3 checks passed
@janisdd
Copy link
Contributor

janisdd commented Oct 10, 2023

Just a small note: This does not work for quoted fields...

1,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
2,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
3,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
4,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
5,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C

5 lines with 12 columns

Papa.parse('1,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n2,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n3,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n4,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n5,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C', 
{skipFirstNLines: 5})

skips the "real" first line (5 because \n after each line).

@pokoli
Copy link
Collaborator

pokoli commented Oct 10, 2023

@janisdd It will be great if you can fill an issue/merge request to make this work with quotes also.

@janisdd
Copy link
Contributor

janisdd commented Oct 10, 2023

Well, no... I don't think this feature should be added to the library. Here are my thoughts:

  • it is not clear how it interacts with other settings (e.g. header, skipEmptyLines, to be fair, the interaction of other options is not good documented either)
  • can be solved by 1 line of code (see issue) and also works for quoted fields (via parsing everything)
  • even if implemented to work with quoted fields, one has to parse the first N lines and then throw them away
    • this is because we only know it's a quoted field when fully parsed
    • so, skipping is essentially just parsing + discarding -> can be done by the user (in 1 line)

However, here are some notes if someone wants to implement it anyway:

  • have a look at how the setting preview is implemented
  • ... essentially just not add the row in
    function pushRow(row)
    as long as some skip counter is too small

I'm not sure if this would also work for streaming/chunking (or whatever, I just use the normal parsing). But as far as i know there is only one core logic/function for the actual parsing... which means this should also work for streaming

@jkruke
Copy link

jkruke commented Mar 6, 2024

I think this feature hasn't been implemented as wanted in the referenced issue #738 !
The example CSV was:

This is a data file generated by some old software.
Next line will contain a headers of parameters.
Temperature, Humidity, Voltage
22.5, 45.5, 220
23.0, 44.0, 219

Expected output should be (to be fair, it wasn't precisely specified):

Temperature, Humidity, Voltage
22.5, 45.5, 220
23.0, 44.0, 219

But the actual output with skipFirstNLines=2 is (according to the test cases):

This is a data file generated by some old software.
23.0

Analogue to the following test case in the code:
https://github.com/mholt/PapaParse/pull/1021/files#diff-e0ce8cb4901057c1880bee545909a64f38c7383b4d41982d6f2db9a8ec81eac7R1588

{
		description: "Skip First N number of lines , with header and 3 rows",
		input: 'a,b,c,d\n1,2,3,4\n4,5,6,7',
		config: { header: true, skipFirstNLines: 1 },
		expected: {
			data: [{a: '4', b: '5', c: '6', d: '7'}],
			errors: []
		}
	}

This test case is not realistic because it does not reflect a CSV with some preamble lines to be skipped.
should be rather: input: 'to-be-ignored\na,b,c,d\n1,2,3,4', and expected.data: [{a: '1', b: '2', c: '3', d: '4'}].

If someone wanted to skip rows of the record set, it should not be done during the parsing phase, but later when working with the data sets by doing some simple postprocessing such as data = data.slice(1).

@bhuvaneshwararaja, @pokoli could you please take a look at it and give me some feedback about my thoughts? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants