Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64 #3744

Closed
sacundim opened this issue Feb 22, 2023 · 5 comments · Fixed by #3746
Labels
arrow Changes to the arrow crate bug

Comments

@sacundim
Copy link

sacundim commented Feb 22, 2023

Describe the bug

Starting with version 0.6.1 of csv2parquet (which upgraded its arrow-rs dependency from 24.0 to 30.0.1), input CSV files that were converting just fine are now failing with errors like this one:

Error: External(ParseError("Error while parsing value 2020-03-19 00:00:00 for column 4 at line 2"))

The csv2parquet tool is a very thin wrapper around arrow-rs that basically

  1. Calls arrow::csv::reader::infer_file_schema to infer a schema for the input file;
  2. Builds an arrow::csv::Reader from that schema and uses it to read the file;
  3. Outputs a Parquet file.

The tool has a command line option that prints out the inferred schemas, and input values like that are being inferred into Date64:

    {
      "name": "Sample Date",
      "data_type": "Utf8",
      "nullable": false,
      "dict_id": 0,
      "dict_is_ordered": false,
      "metadata": {}
    }

And I've written a unit test case that demonstrates that

  1. Arrow-rs is inferring Date64 type for input values like "2020-03-19 00:00:00"
  2. ...but attempting to parse them as Date64 fails

To Reproduce

I wrote a couple of very simple test cases illustrating the problem:

My test_can_parse_inferred_date64 in arrow-csv/src/reader/mod.rs specifically shows how infer_field_schema returns Date64 for the example string and yet when we feed it to parse_item::<Date64Type> we get None.

Expected behavior

I see that the various Timestamp types are able to parse strings like that correctly, so I don't understand whether the more correct behavior for this library would be to infer a Timestamp type for these strings instead of the Date64 type. It does seem clear to me that however that it should be possible to parse these strings as Date64.

@sacundim sacundim added the bug label Feb 22, 2023
sacundim pushed a commit to sacundim/arrow-rs that referenced this issue Feb 22, 2023
…elds like "2020-03-19 00:00:00" that it can't parse to Date64)
@alexhallam
Copy link

I came to make the same comment.

here is a sample csv

"","origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib","time_hour"
"1","EWR",2013,1,1,1,39.02,26.06,59.37,270,10.35702,NA,0,1012,10,2013-01-01 01:00:00
"2","EWR",2013,1,1,2,39.02,26.96,61.63,250,8.05546,NA,0,1012.3,10,2013-01-01 02:00:00
"3","EWR",2013,1,1,3,39.02,28.04,64.43,240,11.5078,NA,0,1012.5,10,2013-01-01 03:00:00

The last column time_hour will not parse.

Example

Here is some code.

use arrow::array::{Array, StringArray};
use arrow::compute::cast;
use arrow::datatypes::DataType::{
    Boolean, Date32, Date64, Float64, Int64, Interval, List, Time32, Time64, Timestamp, Utf8,
};
use arrow::ipc::Time;
use arrow::record_batch::RecordBatch;
use arrow_csv::reader;
use std::fs::File;

fn main{
    let path = "data/weather.csv".to_owned();

    // infer the schema using arrow_csv::reader
    let schema = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
    let schema_data_types = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);

    //for each feild in the schema, match on the data type and push a string to a vectotr `Vec<String>`.
    let data_types: Vec<String> = schema_data_types
        .expect("Schema should be infered")
        .fields()
        .iter()
        .map(|field| {
            let data_type = field.data_type();
            match data_type {
                Boolean => "<bool>".to_string(),
                Int64 => "<int>".to_string(),
                Float64 => "<dbl>".to_string(),
                Utf8 => "<chr>".to_string(),
                List(_) => "<list>".to_string(),
                Date32 => "<date>".to_string(),
                Date64 => "<date64>".to_string(),
                Timestamp(_, _) => "<ts>".to_string(),
                Time32(_) => "<time>".to_string(),
                Time64(_) => "<time64>".to_string(),
                _ => "<_>".to_string(),
            }
        })
        .collect();

    // print the data types
    println!("data types {:?}", data_types);

    let file = File::open(path).unwrap();
    let mut reader = reader::Reader::new(
        file,
        Arc::new(schema.expect("Schema should be infered")),
        true,
        Some(44),
        1024,
        None,
        None,
        None,
    );
}
    // convert reader to record batch
    let record_batch: RecordBatch = reader.next().unwrap().unwrap().clone();

    // print record batch
    println!("{:?}", record_batch);

The Error

data types ["<int>", "<chr>", "<int>", "<int>", "<int>", "<int>", "<dbl>", "<dbl>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<date64>"]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError("Error while parsing value 2013-01-01 02:00:00 for column 15 at line 2")'

@alexhallam
Copy link

alexhallam commented Feb 22, 2023

Expected Behavior

Does not matter much to me. I just need (what I would call a date time) "2013-01-01 02:00:00" to not error. It would by nice if there was a date_time enum option, but I can use regex and parse this myself as needed.

@tustvold
Copy link
Contributor

I think the CSV reader should be inferring Timestamp for such columns, it is unclear to me that it should ever infer Date64, as the semantics of such a type are somewhat unclear, at least to me.

I have posted an email to the arrow mailing list to try to clarify what the Date64 type is for - https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419

@alexhallam
Copy link

alexhallam commented Feb 22, 2023

@tustvold I think that is sensible. I would call 2013-01-01 02:00:00 a Timestamp before I called it a Date64.

tustvold added a commit that referenced this issue Feb 23, 2023
* Infer 2020-03-19 00:00:00 as timestamp not Date64 in CSV (#3744)

* Update inference logic
@tustvold tustvold added the arrow Changes to the arrow crate label Feb 24, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #3746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants