Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing numeric literals can cause TooShort/Invalid #1399

Open
t-mart opened this issue Feb 1, 2024 · 8 comments
Open

Parsing numeric literals can cause TooShort/Invalid #1399

t-mart opened this issue Feb 1, 2024 · 8 comments

Comments

@t-mart
Copy link

t-mart commented Feb 1, 2024

For a format/input string item that starts with a numeric literal, the various parse_from_str methods can return a ParseError error result of TooShort or Invalid if that literal comes directly after a numeric specifier/value with variable parsing width (PW=∞).

In other words, this problem can occur when, for a consecutive pair of format items, a and b, a is a specifier value of variable width and b is a literal that starts with a digit:

  • %s1 timestamp followed by literal digit
  • %Y1 year followed by literal digit

Of particular note is that this same sequence of (specifier, literal) manifests as errors of two different variants:

TooShort Example:

#[test]
fn test_timestamp() {
    use chrono::{DateTime, format::ParseErrorKind::TooShort};

    let literals = ['1', 'a'];
    for literal in literals {
        let input = format!("1{literal}");
        let fmt = format!("%s{literal}");

        let parsed = DateTime::parse_from_str(&input, &fmt);

        if literal.is_ascii_digit() {
            assert_eq!(parsed.unwrap_err().kind(), TooShort);
        } else {
            assert!(parsed.is_ok());
        }
    }
}

This same behavior exists for the other (PW=∞) specifiers, such as Year (e.g., NaiveDate::parse_from_str("01-01-+100001", "%m-%d-%Y1").

Invalid Example

Note that this case contains the same specifier-literal pair as above, but the error is different. Probably because the pair is not at the end of the strings.

#[test]
fn test_year2() {
    use chrono::{NaiveDate, format::ParseErrorKind::Invalid};

    let literals = ['1', 'a'];
    for literal in literals {
        let input = format!("+10000{literal}-01-01");
        let fmt = format!("%Y{literal}-%m-%d");

        let parsed = NaiveDate::parse_from_str(&input, &fmt);

        if literal.is_ascii_digit() {
            assert_eq!(parsed.unwrap_err().kind(), Invalid);
        } else {
            assert!(parsed.is_ok());
        }
    }
}

Cause?

I believe this has to do with the fact that scan::number is greedy against any next digit(s), even if that may consume a next literal. In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.

I do concede that this kind of input is pathological and would rarely occur in practical usage. But the current implementation feels wrong that it can do this just depending on the "numericality" of a literal, at least without it being documented. And, while there are a few mentions that the parser is greedy, it seems to only consider the cases where the input is missing obvious text from its format. (On the other hand, maybe it's just ok to say the parser is greedy as we currently do, and call it a day. Not sure.)

Options

If this finding is valid, (and indeed, if we do not want to implement a regex-engine-style approach) here are a few ways we could handle it:

  • During input parsing, reject numerically-prefixed literals after variable-length specifiers (Year, IsoYear, and Timestamp) by returning Invalid or some new error variant (e.g. GreedyValue). We could determine this with minimal (?) overhead by something like:

    // pseudocode
    // after parsing a last item's text and before parsing the current one...
    if last_item_parsed_with_variable_width && cur_item_is_literal && cur_item_literal_starts_with_number:
      return Err(GreedyValue);
    

    Note that Year only exhibits this behavior when parsed with a leading +/-, so regular 4-digit years would never error this way.

  • The same as above, but during format parsing (i.e., StrftimeItems::new(fmt)). This would short-circuit on formats where the input could exhibit this behavior, but would be a false positive in the 4-digit Year case (no +/-), which we can only know once we start parsing the input.

  • Or, just call this out in the documentation: Update TooShort and Invalid's documentation to include this possibility, and maybe in some other high-level documentation places as well. This would be easiest and obviously cause no performance overhead, but maybe not as correct as we could be.

Versions

  • rustc: rustc 1.77.0-nightly (5518eaa94 2024-01-29)
  • chrono: 0.4.33
@jtmoon79
Copy link
Contributor

jtmoon79 commented Feb 2, 2024

Wow, nice report here @t-mart !

I recommend providing a PR (assuming @pitdicker @djc are also interested and would like to see a PR first). Pick the option that you feel would be best to implement.

I recommend at least two commits to the PR to demonstrate the behavioral change

  1. add any necessary new test cases (the test cases probably already exist somewhere in src/format/parse.rs)
  2. code change + test case adjustments

Chrono maintainers may decide on a different option but tangible code is a great starting point for a longer discussion.


Overall, I agree that the errors from the various parse methods are broad. When implementing a parse function call, I often have to "throw spaghetti at the wall" until I get something sticking because the error offers little insight.

@pitdicker
Copy link
Collaborator

pitdicker commented Feb 3, 2024

@t-mart Thank you for the report. We had an issue and closed PR related to the problem of some specifies accepting a variable number of digits without look-ahead: #560 and #561.

The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.

I'll have a better look at this issue later, maybe tomorrow. Would you be willing to work on it?

@pitdicker
Copy link
Collaborator

In general, the parse algorithm does not appear to backtrack to find another possible match location. This is an understandable compromise. Allowing for this case would essentially be implementing a regex engine.

The linked old PR shows a compromise. If backtracking can be limited to only one or a few functions that do the integer parsing I don't think it is a bad solution.

@pitdicker
Copy link
Collaborator

pitdicker commented Feb 3, 2024

The main difficulty is that we support iterating over the format string while iterating over the input string. So we currently don't know the next formatting specifier while parsing some input.

One option is to add some complexity to the strftime parser to read ahead when it encounters a variable-with formatting specifier. If it is followed by a fixed number of digits (either other formatting specifiers or plain digits), it stores this number of fixed digits as extra information in the formatting item. If a variable-with specifier is followed by another variable-with specifier it is an error.

Then the parser can use this information to leave some digits remaining when parsing a variable-with specifier.

Disadvantage is the double parsing of parts of the format string. And that doing so is unnecessary for the other use: formatting. But that may not turn out to be much of a problem in practice. The double parsing of a few characters in the format string would only happen in cases where we now fail to parse an input string at all.

@t-mart
Copy link
Author

t-mart commented Feb 5, 2024

Thank you for your consideration and guidance, everyone. I intend to PR on this. Give me a few days to get to some free time.

@DimDroll
Copy link

@t-mart thanks for leading effort on this issue.

Can you please share if you had time to look into it?

vector.dev community uses chrono for datetime parsing and interest in this feature is growing with those cases:
vectordotdev/vrl#117
vectordotdev/vrl#790
which is mentioned by vector.dev maintainer here:
#560
as you seem to be aware.

No pressure, just curios on whether slipped through the cracks or any help is needed.
Thanks again.

@t-mart
Copy link
Author

t-mart commented Apr 15, 2024

Can you please share if you had time to look into it?

Thanks for the update/reminder, @DimDroll.

Regrettably, I have not had time to start work here. Please jump in!

@pitdicker
Copy link
Collaborator

Fox expectations: this is something that is not an easy fix that can be done in a couple of hours of work (at least it is not easy for me 😄). Not even sure if it can even be done in a backwards-compatible way.

But feel free to dive in and come up with a plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants