Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing datehour string without zero padded hour gives wrong output #8849

Closed
2 tasks done
dikesh opened this issue May 15, 2023 · 13 comments · Fixed by #9044
Closed
2 tasks done

Parsing datehour string without zero padded hour gives wrong output #8849

dikesh opened this issue May 15, 2023 · 13 comments · Fixed by #9044
Labels
bug Something isn't working python Related to Python Polars

Comments

@dikesh
Copy link

dikesh commented May 15, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I am trying to parse datehour where hour is not zero padded and both the expressions Expr.str.strptime and Expr.str.to_datetime parses hour to zero when hour has single digit

e.g.
2023-05-04|7 -> 2023-05-04 00:00:00
2023-05-04|10 -> 2023-05-04 10:00:00

I tried following expressions but no luck

pl.col('dateHour').str.to_datetime('%Y-%m-%d|%H')
pl.col('dateHour').str.to_datetime('%Y-%m-%d|%k')
pl.col('dateHour').str.to_datetime('%Y-%m-%d|%-H')
pl.col('dateHour').str.to_datetime('%Y-%m-%d|%-k')

Reproducible example

import polars as pl

df = pl.from_dict({'dateHour': [
        "2023-05-04|7",
        "2023-05-04|8",
        "2023-05-04|9",
        "2023-05-04|10",
        "2023-05-04|11",
        "2023-05-05|12",
]})

print(df.with_columns(pl.col('dateHour').str.to_datetime('%Y-%m-%d|%H').alias('parsedDateHour')))

# Output

# shape: (6, 2)
# ┌───────────────┬─────────────────────┐
# │ dateHour      ┆ parsedDateHour      │
# │ ---           ┆ ---                 │
# │ str           ┆ datetime[μs]        │
# ╞═══════════════╪═════════════════════╡
# │ 2023-05-04|7  ┆ 2023-05-04 00:00:00 │
# │ 2023-05-04|8  ┆ 2023-05-04 00:00:00 │
# │ 2023-05-04|9  ┆ 2023-05-04 00:00:00 │
# │ 2023-05-04|10 ┆ 2023-05-04 10:00:00 │
# │ 2023-05-04|11 ┆ 2023-05-04 11:00:00 │
# │ 2023-05-05|12 ┆ 2023-05-05 12:00:00 │
# └───────────────┴─────────────────────┘

Expected behavior

# Output after zero padding
print(df.with_columns(
    pl.col('dateHour')
    .str.split('|').arr.eval(pl.element().str.zfill(2)).arr.join('|')
    .str.to_datetime('%Y-%m-%d|%H').alias('parsedDateHour')
))


# Expected output

# shape: (6, 2)
# ┌───────────────┬─────────────────────┐
# │ dateHour      ┆ parsedDateHour      │
# │ ---           ┆ ---                 │
# │ str           ┆ datetime[μs]        │
# ╞═══════════════╪═════════════════════╡
# │ 2023-05-04|7  ┆ 2023-05-04 07:00:00 │
# │ 2023-05-04|8  ┆ 2023-05-04 08:00:00 │
# │ 2023-05-04|9  ┆ 2023-05-04 09:00:00 │
# │ 2023-05-04|10 ┆ 2023-05-04 10:00:00 │
# │ 2023-05-04|11 ┆ 2023-05-04 11:00:00 │
# │ 2023-05-05|12 ┆ 2023-05-05 12:00:00 │
# └───────────────┴─────────────────────┘

Installed versions

--------Version info---------
Polars:      0.17.13
Index type:  UInt32
Platform:    Linux-5.19.0-41-generic-x86_64-with-glibc2.35
Python:      3.11.3 (main, Apr  5 2023, 14:14:37) [GCC 11.3.0]

----Optional dependencies----
numpy:       1.24.3
pandas:      <not installed>
pyarrow:     11.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@dikesh dikesh added bug Something isn't working python Related to Python Polars labels May 15, 2023
@ritchie46
Copy link
Member

Please look at the docs: https://docs.rs/chrono/latest/chrono/format/strftime/index.html

Your pattern doesn't seem correct.

%H | 00 | Hour number (00–23), zero-padded to 2 digits.

@dikesh
Copy link
Author

dikesh commented May 15, 2023

@ritchie46 I had looked at the doc. But I am not able to understand how to parse hour with single digit as shared in sample. I have also shared the expressions that I tested. Am I missing something?

@MarcoGorelli
Copy link
Collaborator

Got it, this happens because chrono accepts it:

chrono::NaiveDate::parse_from_str("2020-01-01|01", "%Y-%m-%d|%H")

outputs

Ok(2020-01-01)

I'll take a look later, I think it this should just error, as NaiveDateTime wouldn't accept it

@ritchie46
Copy link
Member

Isn't the issue that the hours don't have leading zero's, so the fmt doesn't fit?

@dikesh
Copy link
Author

dikesh commented May 15, 2023

I assumed it should work like python's datetime parser but I think chrono has different behaviour.

from datetime import datetime as dt

df.with_columns(
    pl.col('dateHour').str.to_datetime('%Y-%m-%d|%H').alias('parsedWithChrono'),
    pl.col('dateHour').apply(lambda dh: dt.strptime(dh, '%Y-%m-%d|%H')).alias('parsedWithoutChrono'),
)

Output:

shape: (6, 3)
┌───────────────┬─────────────────────┬─────────────────────┐
│ dateHour      ┆ parsedWithChrono    ┆ parsedWithoutChrono │
│ ---           ┆ ---                 ┆ ---                 │
│ str           ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════════════╪═════════════════════╪═════════════════════╡
│ 2023-05-04|7  ┆ 2023-05-04 00:00:00 ┆ 2023-05-04 07:00:00 │
│ 2023-05-04|8  ┆ 2023-05-04 00:00:00 ┆ 2023-05-04 08:00:00 │
│ 2023-05-04|9  ┆ 2023-05-04 00:00:00 ┆ 2023-05-04 09:00:00 │
│ 2023-05-04|10 ┆ 2023-05-04 10:00:00 ┆ 2023-05-04 10:00:00 │
│ 2023-05-04|11 ┆ 2023-05-04 11:00:00 ┆ 2023-05-04 11:00:00 │
│ 2023-05-05|12 ┆ 2023-05-05 12:00:00 ┆ 2023-05-05 12:00:00 │
└───────────────┴─────────────────────┴─────────────────────┘

@MarcoGorelli
Copy link
Collaborator

Isn't the issue that the hours don't have leading zero's, so the fmt doesn't fit?

that's fine, it's allowed:

chrono::NaiveDateTime::parse_from_str("2020-01-01 5:35", "%Y-%m-%d %H:%M")

outputs

Ok(2020-01-01T05:35:00)

The issue is that NaiveDateTime::parse_from_str fails with the given format (as expected), but then the fallback NaiveDate::parse_from_str passes, resulting in the hour component being dropped

@ritchie46
Copy link
Member

Ah, right.. That's bad indeed.

@MarcoGorelli
Copy link
Collaborator

looks like chrono would be open to raising on their side for this chronotope/chrono#1075 (comment)

that might be a better solution - I'll see if I can get this in there

@MarcoGorelli
Copy link
Collaborator

Opinions in chrono seem to be differing between developers

So, possible solutions could be:

  1. "hack" around chrono, and raise if the format string contains ("%H" or "%I") but not "%M". Not hard, but I feel very uneasy about this, some other unforeseen issue could well come up as a result of this (plus it adds complexity)
  2. parse datetime as datetimes, and dates as date, as I'm suggesting here: fix(rust, python)!: don't parse inputs without hour and minute components as pl.Datetime (use pl.Date instead) #8850 (may need to be a breaking change for 0.18)
  3. keep the current output, and hope that chrono introduces a new ParseErrorKind which would allow to work around this

I think I like the sound of 2) the most, as it's simple and predictable

@ritchie46
Copy link
Member

Couldn't we check in the Date parser if it consumed the whole fmt? I like the idea that Datetime can parse anything (dates and datetimes).

And that Date only can parse dates, and if it contains any time it should error suggesting to use Datetime instead. Would that be possible?

@MarcoGorelli
Copy link
Collaborator

that would be nice, I'm just a bit wary about working around chrono, I'm worried this is going to lead to subtle hard-to-debug issues later...but I'll see, maybe it's not as bad as I'm thinking

@ritchie46
Copy link
Member

Yes, I don't mean that we work around chrono. Maybe we can get the parsed length from chrono ad check if it consumed the whole string. If it it didn't we fail. I thought chrono had an error showing that as well.

The ParserErrorKind now is public: chronotope/chrono#588

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented May 25, 2023

I might be missing something, but I don't any way to get this from chrono:

use chrono;
fn main() {
    println!("{:?}", chrono::NaiveDateTime::parse_from_str("2020-01-01 17", "%Y-%m-%d %H"));
    println!("{:?}", chrono::NaiveDateTime::parse_from_str("2020-01-01", "%Y-%m-%d"));
    println!("{:?}", chrono::NaiveDate::parse_from_str("2020-01-01 17", "%Y-%m-%d %H"));
    println!("{:?}", chrono::NaiveDate::parse_from_str("2020-01-01", "%Y-%m-%d"));
}

outputs

Err(ParseError(NotEnough))
Err(ParseError(NotEnough))
Ok(2020-01-01)
Ok(2020-01-01)

and it's not clear how to distinguish them. I've pasted a minimal reproducer of the issue in chronotope/chrono#1075 anyway, hope there's a non-hacky and performant solution 🤞 (it looks like opinions differ among their devs though, so it may be unlikely to change)

If Datetime should accept %Y-%m-%d, then two solutions that some to mind are:

  • add an extra statement in transform_datetime_*s like
            if nd.map(|nd| nd.format(fmt).to_string()) == Some(val.to_string()) {
                return None
            };
    . I haven't timed it, but it looks like there would be performance implications, and am not sure they'd be justified for such an edge case?
  • in as_datetime, add a check like
              if hour_pattern.is_match(&fmt) && !minute_pattern.is_match(&fmt) {
                  // (hopefully) temporary hack. Ideally, chrono would return a ParseKindError indicating
                  // if `fmt` is too long for NaiveDate. If that's implemented, then this check could
                  // be removed, and that error could be matched against in `transform_datetime_*s`
                  // See https://github.com/chronotope/chrono/issues/1075.
                  polars_bail!(ComputeError: "Invalid format string: found hour, but not minute directive");
              }
    This would only need doing once per Series, so no real performance implications. Just concerned some other edge case is gonna appear, but hopefully by then there'll be a robust solution in Chrono which can be used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants