Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement column types to allow constraining input #7

Open
3 of 4 tasks
krassowski opened this issue Jun 3, 2020 · 6 comments
Open
3 of 4 tasks

Implement column types to allow constraining input #7

krassowski opened this issue Jun 3, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@krassowski
Copy link
Collaborator

krassowski commented Jun 3, 2020

Idea: implement a column-wise validation of the user input using the column types feature of jExcel.

I frequently find myself having to work with the spreadsheets having mistakes in data entry process (like dates being in UK and US format). Having the (optional) column types could alleviate at least some of the problem (i.e. decrease probability of myself and you from creating invalid files; cannot help about the other Excel users though).

The design idea is to have a bar above the column names with a drop-down allowing to choose column type. By default all would be "text", and the bar would be hidden. On pressing a toolbar button the bar with type drop-downs would show up and the user would be able to choose the column type.

Questions to the community/users/myself:

  • should the information about column types be stored in the CSV/TSV file?
  • if yes, is there any standard specifying how to store the column type information?
  • should we allow full-featured html entries (such as photos, see the jExcel example)? How would we save those?
  • column types have options (e.g. mask for currency, or choices for dropdown); how those should be specified? Should those be saved as well?
@krassowski
Copy link
Collaborator Author

krassowski commented Jun 4, 2020

Column type specification formats for CSV files in use

Separate .csvt file

This idea is used by qgis (which is a tool from the geospatial data world) and is a second csv file which has a single line and specifies the types of the columns.

Something-else-separated suffix after column name

neo4j uses colon-separated format of name:type for database imports: documentation.

:ID , name, joined:date, active:boolean, points:int

A similar idea but using | characters is used by spatialkey

Prefixes before headers

Another tool, flexmonster uses prefixes before column names; those look similar to C format specifiers.

Official recommendations and proposed standards

The stance of a UK government group

The UK Government Digital Service recommends not to include any additional data in CSV files:

This standard does not cover extra information about the tabular data, such as the column types, or table contents validation. You should keep this type of information separate from a CSV file to avoid it conflicting with the data structure as described in RFC 4180.

However, I could not find any such recommendation in RFC 4180. The neo4j format appears to be RFC 4180 compatible. I think the only problem would be if someone wanted to store this information in a new line (i.e. it would no longer be compatible).

W3C

The CSV on the Web: A Primer from W3C Working Group, from 25 February 2016 states:

There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.

And then proposes a schema for describing and validating CSV files. For the following file:

"country","name (en)","latitude","longitude"
"at","Austria","47.6965545","13.34598005"
"be","Belgium","50.501045","4.47667405"

they propose a schema like this:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "datatype": "string"
    },{
      "titles": "name (en)",
      "datatype": "string"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

It appears to be a very advanced schema, with a huge potential for validation purposes. The document is a final draft and no recommendation nor endorsment of it was issued by W3C, so it is not an official standard.

Repo: https://github.com/w3c/csvw
The follow-up happens in https://www.w3.org/community/csvw/.

There is also another, older schema format: https://specs.frictionlessdata.io//tabular-data-package/

@krassowski
Copy link
Collaborator Author

I am leaning towards either implementing the W3C-proposed schema, or towards .csvt. The former is more flexible, the latter is easier. UX-wise this should be optional - I would not want to have a new file created for every csv I edit. Not sure how I would communicate it in the UI.

The idea with adding a second header may be sub-optimal because it would be a pain to parse. Similarly the prefixes/suffixes. A lot can go wrong.

As I need the tool now (now is better than never), I will proceed with the .csvt approach (unless there is any feedback soonish). The file will be only created after the user changes a type of a column.
In the future the schema approach can be added and the user could be allowed to configure which of the approaches to use in the settings. And the user could disable writing the csvt/schema files too.

@krassowski
Copy link
Collaborator Author

Maybe there could be a checkbox (off by default):

☐ Save the column types (csvt)

which would be only shown if the file was detected or if a type of any column was changed?

@krassowski
Copy link
Collaborator Author

Also, this would be useless if viewing an Excel spreadsheet, so should have a condition on the filename not ending with xls/xlsx (or on mimetype not being Excel)

@krassowski
Copy link
Collaborator Author

should we allow full-featured html entries (such as photos, see the jExcel example)? How would we save those?

Maybe we should just allow any html (thus also allowing formating, things like bold and italic).
Maybe this should be opt-in. I do not know how to communicate this in UI. Maybe we should have an option "clean all formating"

Note: HTML can be easier to parse than Excel.

@krassowski
Copy link
Collaborator Author

krassowski commented Jun 5, 2020

Proof of concept implemented in eb3c879 (v0.3):

Screenshot from 2020-06-05 15-47-52

The types are not stored yet and styling requires work - but it is just POC.

@krassowski krassowski added the enhancement New feature or request label Jun 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant