CSVlints Provided Feedback
In a ticket Sumica asked for the following: 'a guide on how to publish open data at the ODI'
There is a need for a technical guidance on publishing the data, including:
using GitHub for hosting the data
using CSVLint to ensure that the CSV file is clean
get Open Data Certificate as part of the publishing process
This page of rough notes takes 'using CSVLint to ensure that the CSV file is clean' as its aim and explores how the errors and warnings reported in the current dialect check loop could be improved.
At present the CSVlint FAQ/help states the following:
That won't fix all the problems: we won't delete empty lines or try to fix up values that are in the >wrong format. We can't change the way your server provides CSV either, so you'll still be warned if >it's not using the right Content-Type header.
SF feels that the 'dialect check' is nomenclature that might be offputting for the lay user
The dialect check loop works for both hyperlink (URI) CSV files and for CSV files uploaded by a user
Errors | Warnings | Messages | |
---|---|---|---|
Structure | |||
Schema | |||
Context |
If there are structure Errors then the dialect validation loop must continue
Need a list of the structure Errors warnings that dialect can FIX!
Currently the dialect validation loop is triggered ONLY WHEN
if @result.warnings.select { |warning| warning.type == :check_options }.any?
within views>show.html.erb
this is repetition of csvlint.rb README, with strikethroughs on the errors that cannot be remedied by dialect
The following types of error can be reported:
* :wrong_content_type
-- content type is not text/csv
-
:ragged_rows
-- row has a different number of columns (than the first row in the file) MAYBE VIA HEADER ROW? -
:blank_rows
-- completely empty row, e.g. blank line or a line where all column values are empty -
:invalid_encoding
-- encoding error when parsing row, e.g. because of invalid characters MAYBE? ~~ -
:not_found
-- HTTP 404 error when retrieving the data~~ Fatal Error not worth triggering dialect -
:stray_quote
-- missing or stray quote MAYBE? -
:unclosed_quote
-- unclosed quoted field MAYBE?*:whitespace
-- a quoted column has leading or trailing whitespace -
:line_breaks
-- line breaks were inconsistent or incorrectly specified -
:undeclared_header
-- if there is no machine-readable description of whether a header is present (e.g. in a dialect orContent-Type
header)
The following types of warning can be reported:
* FATAL - server side problem:no_encoding
-- the Content-Type
header returned in the HTTP request does not have a charset
parameter
* :encoding
-- the character set is not UTF-8
* :no_content_type
-- file is being served without a Content-Type
header
* should be fatal error:excel
-- no Content-Type
header and the file extension is .xls
-
:check_options
-- CSV file appears to contain only a single column
*:inconsistent_values
-- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
*:empty_column_name
-- a column in the CSV header has an empty name
*:duplicate_column_name
-- a column in the CSV header has a duplicate name -
:title_row
-- if there appears to be a title field in the first row of the CSV
Only a subset of the above warnings and errors are applicable to uploaded files
error type | csvlint.rb scenario |
---|---|
:title_row |
features/validation_warnings.feature:69 |
:check_options |
features/csv_options.feature:141 |
:excel |
uncatered to |
:undeclared_header |
uncatered to |
:nonrfc_line_breaks |
features/validation_info.feature:3 features/validation_info.feature:11 features/validation_info.feature:19 |
:line_breaks |
features/validation_errors. feature:138 .feature:146 .feature:154 |
:unclosed_quote |
validation_errors.feature:17 |
:invalid_encoding |
features/validation_errors. feature:87 feature:102 |
:blank_rows |
features/validation_errors .feature:43 .feature:58 .feature:73 |