Skip to content

CSVlints Provided Feedback

Stephen Fortune edited this page Jul 23, 2015 · 18 revisions

In a ticket Sumica asked for the following: 'a guide on how to publish open data at the ODI'

There is a need for a technical guidance on publishing the data, including:

using GitHub for hosting the data
using CSVLint to ensure that the CSV file is clean
get Open Data Certificate as part of the publishing process

This page of rough notes takes 'using CSVLint to ensure that the CSV file is clean' as its aim and explores how the errors and warnings reported in the current dialect check loop could be improved.

At present the CSVlint FAQ/help states the following:

That won't fix all the problems: we won't delete empty lines or try to fix up values that are in the >wrong format. We can't change the way your server provides CSV either, so you'll still be warned if >it's not using the right Content-Type header.

SF feels that the 'dialect check' is nomenclature that might be offputting for the lay user

The dialect check loop works for both hyperlink (URI) CSV files and for CSV files uploaded by a user

Errors Warnings Messages
Structure
Schema
Context

If there are structure Errors then the dialect validation loop must continue

Need a list of the structure Errors warnings that dialect can FIX!

Currently the dialect validation loop is triggered ONLY WHEN

if @result.warnings.select { |warning| warning.type == :check_options }.any?
  
within views>show.html.erb

Error Reporting

this is repetition of csvlint.rb README, with strikethroughs on the errors that cannot be remedied by dialect

Errors

The following types of error can be reported:

* :wrong_content_type -- content type is not text/csv

  • :ragged_rows -- row has a different number of columns (than the first row in the file) MAYBE VIA HEADER ROW?
  • :blank_rows -- completely empty row, e.g. blank line or a line where all column values are empty
  • :invalid_encoding -- encoding error when parsing row, e.g. because of invalid characters MAYBE? ~~
  • :not_found -- HTTP 404 error when retrieving the data~~ Fatal Error not worth triggering dialect
  • :stray_quote -- missing or stray quote MAYBE?
  • :unclosed_quote -- unclosed quoted field MAYBE? * :whitespace -- a quoted column has leading or trailing whitespace
  • :line_breaks -- line breaks were inconsistent or incorrectly specified
  • :undeclared_header -- if there is no machine-readable description of whether a header is present (e.g. in a dialect or Content-Type header)

Warnings

The following types of warning can be reported:

* :no_encoding -- the Content-Type header returned in the HTTP request does not have a charset parameter FATAL - server side problem
* :encoding -- the character set is not UTF-8
* :no_content_type -- file is being served without a Content-Type header
* :excel -- no Content-Type header and the file extension is .xls should be fatal error

  • :check_options -- CSV file appears to contain only a single column
    * :inconsistent_values -- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
    * :empty_column_name -- a column in the CSV header has an empty name
    * :duplicate_column_name -- a column in the CSV header has a duplicate name
  • :title_row -- if there appears to be a title field in the first row of the CSV

Only a subset of the above warnings and errors are applicable to uploaded files

error type csvlint.rb scenario
:title_row features/validation_warnings.feature:69
:check_options features/csv_options.feature:141
:excel uncatered to
:undeclared_header uncatered to
:nonrfc_line_breaks features/validation_info.feature:3 features/validation_info.feature:11 features/validation_info.feature:19
:line_breaks features/validation_errors. feature:138 .feature:146 .feature:154
:unclosed_quote validation_errors.feature:17
:invalid_encoding features/validation_errors. feature:87 feature:102
:blank_rows features/validation_errors .feature:43 .feature:58 .feature:73