Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support String Coercion in Raw JSON Reader #3736

Merged
merged 2 commits into from Feb 21, 2023

Conversation

rguerreiromsft
Copy link
Contributor

@rguerreiromsft rguerreiromsft commented Feb 20, 2023

Which issue does this PR close?

None. This is a feature request.

Rationale for this change

infer_json_schema coerces primitive values into string, and it couldn't work when reading the same json with the given schema.

        let schema = Schema::new(vec![
            Field::new("c1", DataType::Utf8, true),
            Field::new("c2", DataType::Utf8, true),
        ]);

        let inferred_schema = infer_json_schema_from_iterator(
            vec![
                Ok(serde_json::json!({"c1": 1, "c2": 1})),
                Ok(serde_json::json!({"c1": "a", "c2": 0})),
                Ok(serde_json::json!({"c1": true, "c2": "ok"})),
            ]
            .into_iter(),
        )
        .unwrap();

        assert_eq!(inferred_schema, schema);

What changes are included in this PR?

Add flag to make RawReader behave the same as infer_json_schema.

Are there any user-facing changes?

Yes, the user can opt-in a coerce_primitive flag.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 20, 2023
@rguerreiromsft
Copy link
Contributor Author

rguerreiromsft commented Feb 20, 2023

I'm not really happy with the Loose configuration I implemented because the string conversion to bool doesn't make much sense. And bool doesn't follow javascript either, where it considers an empty list to be false.

Maybe a best implementation for this is to make bools and numbers as they used to be, and consider the loose behaviour to just parse everything to string, because in that case we don't lose any information.

The good thing about this implementation is that if the end-user wants to get their hands dirty, they can implement their custom converter and get it tuned to their needs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written I don't think TapeConverter can be implemented outside the crate as Tape is not public. I happen to think this is a good thing as I would rather what are effectively implementation details not be exposed outside of the crate.

I wonder if we could rework the generics slightly so that TapeConverter is a type-erased implementation detail of the various ArrayDecoder, instead of plumbing it the whole way through?

I do also wonder if we really need generics for this, or if a simple strict flag is all that is needed....

@@ -87,14 +92,15 @@ impl Display for TapeElement {
///
/// [simdjson]: https://github.com/simdjson/simdjson/blob/master/doc/tape.md
#[derive(Debug)]
pub struct Tape<'a> {
pub struct Tape<'a, C: TapeConverter> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make for a cleaner interface if the converter were instead given the tape instead of Tape being parameterised on TapeConverter. This would allow ArrayDecoder and by extension RawDecoder / RawReader to not need a generic

TapeElement::True => Ok("true".into()),
TapeElement::False => Ok("false".into()),
TapeElement::Number(idx) => Ok(tape.get_string(idx).into()),
TapeElement::String(idx) => Ok(format!("\"{}\"", tape.get_string(idx)).into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will correctly escape newlines or quotes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Do you think using serde_json is a good approach to solve this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be completely honest I struggle to understand the use-case, the performance will be bad and it seems like a rather strange API... This sort of schema coercion I'm not sure really belongs in the reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case is that we don't always control the json format we're reading from, but we can control the Schema.

And it's the behaviour given by infer_json_schema:

        let schema = Schema::new(vec![
            Field::new("c1", DataType::Utf8, true),
            Field::new("c2", DataType::Utf8, true),
        ]);

        let inferred_schema = infer_json_schema_from_iterator(
            vec![
                Ok(serde_json::json!({"c1": 1, "c2": 1})),
                Ok(serde_json::json!({"c1": "a", "c2": 0})),
                Ok(serde_json::json!({"c1": true, "c2": "ok"})),
            ]
            .into_iter(),
        )
        .unwrap();

        assert_eq!(inferred_schema, schema);

Copy link
Contributor

@tustvold tustvold Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's the behaviour given by infer_json_schema:

infer_json_schema coerces primitives, I'm not aware of it coercing objects or lists to strings? If it is, that is not intentional

The use case is that we don't always control the json format we're reading from, but we can control the Schema.

Does each file have a consistent schema though, i.e. does a given column have a mix of objects and primitives in the same file? The assumption of arrow typically is that schema evolution occurs at the file granularity, and can therefore be handled at a higher level...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, it's just for primitives, and only to strings, if I'm not mistaken. Let me see what it coerces between number and bool.

I probably just got creative there and asked why not? But thinking about it, what I really need is to be able to read json with the schema that was inferred before, and use it to save to Parquet.

Maybe we don't even need a flag for it, shouldn't this be the default behaviour since infer_json_schema does it by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, confirmed that if you mismatch numbers and bools, it will also consider them to be strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since infer_json_schema does it by default

I think having an option that disables strict enforcement along the lines of the schema inference makes sense 👍, I do think it should still be optional as some workflows care about schema consistency and would rather fail loud 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'll keep it very simple. Just a bool in a struct for now.

}
}

fn obj_to_json_string<'a, C: TapeConverter>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think this behaviour is a little bit strange, I think it should be possible to opt-in to coercing primitives, e.g. string <-> bool, separately from opting into this potentially more surprising behaviour?

@rguerreiromsft
Copy link
Contributor Author

rguerreiromsft commented Feb 20, 2023

I do also wonder if we really need generics for this, or if a simple strict flag is all that is needed....

I was thinking about this when implementing too. I tried to make it a Box before but it didn't like that to_primitive is generic. So it couldn't be object safe.

I was going to offer another PR with just the flag. Instead I'll rework this one and squash everything into one commit at the end. I agree with you that leaking implementation details makes it harder to evolve without breaking changes.

I can create a few different flags that the user can opt-in:

pub enum DecoderParserStrategy {
    PrimitiveToString, // Accepts 1 as "1" and true as "true"
    ObjectToJsonString, // Accepts [123] as "[123]" and {"a": 123} as "{\"a\": 123}"
    NumberToBool, // Accepts 0 as false, otherwise it's true
    BoolToNumber, // true becomes 1 and false becomes 0
}

Then the user can provide these flags using an array, and by default none of them are enabled.

I'll also make sure it can parse \n and " from a string, maybe the best option is to use serde_json to do that.

@tustvold
Copy link
Contributor

tustvold commented Feb 20, 2023

Then the user can provide these flags using an array, and by default none of them are enabled.

I would recommend a struct of booleans, as opposed to an array, given these will need to be checked on the "hot path". I would also recommend keeping it simple, with a single coerce_primitives flag or something... I think complex schema coercion logic is possibly better handled at a higher level...

@rguerreiromsft
Copy link
Contributor Author

@tustvold, thank you so much for all the guidance. I have updated this PR with the suggested changes.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you. I will double-check the benchmarks tomorrow prior to getting this in

@tustvold tustvold changed the title Add converter to allow custom parsing of json data Support String Coercion in Raw JSON Reader Feb 21, 2023
@tustvold
Copy link
Contributor

tustvold commented Feb 21, 2023

I've confirmed this does not seem to have a discernible impact on the benchmarks 🎉 Thank you for sticking with this

@tustvold tustvold merged commit e33dbe2 into apache:master Feb 21, 2023
@ursabot
Copy link

ursabot commented Feb 21, 2023

Benchmark runs are scheduled for baseline = 25da74c and contender = e33dbe2. e33dbe2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants