Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON support #137

Open
mildbyte opened this issue Oct 3, 2022 · 0 comments
Open

JSON support #137

mildbyte opened this issue Oct 3, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@mildbyte
Copy link
Contributor

mildbyte commented Oct 3, 2022

Support for querying / writing JSON values, similar to SQLite's support (https://www.sqlite.org/json1.html) and PG's support (https://www.postgresql.org/docs/14/functions-json.html).

Ideally should be built as a separate crate (e.g. datafusion-json / seafowl-datafusion-json) that defines DataFusion UDFs/UDAFs to manipulate JSON data, but if it isn't implementable as that, we can keep it in Seafowl or contribute as a patch to DataFusion.

Minimum useful set of functions

  • json_extract (SQLite-style) / json_extract_path (PG-style): get a value out of a JSON value
  • json_set: replace/add a value inside of a JSON value
  • json_group_array+json_group_object (SQLite-style) / json_agg+json_object_agg (PG-style): aggregation function that turns a table into a JSON object/array
  • json_object/json_array: create an inline JSON object/array from the function arguments

Investigate how to "tag" a JSON value using Arrow's type system

SQLite tries to get away with treating JSON values as strings, but sometimes we do need to distinguish between a string and a string that is actually a JSON object.

There's a difference between

json_set('{"a":2,"c":4}', '$.a', '{"b": 42}')

and

json_set('{"a":2,"c":4}', '$.a',  json('{"b": 42}'))

In the first case, we're setting the value of a to be the string {"b": 42}. In the second case, we're setting it to a JSON object. We might need to define an equivalent function / Arrow datatype (even if we ultimately store JSON values as strings in the Parquet file, we need to occasionally distinguish between strings and strings that are JSON values in our type system).

We could use the Arrow struct type (https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/datatype.rs#L169), though it requires the types of the values to be predefined. We could instead make the struct have a single field, json: Utf8, that Seafowl will know to treat as JSON.

Some inspiration from Snowflake? https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html

Table-valued functions (set-returning functions)

SQLite JSON functions like json_each/json_tree and PG JSON functions like json_each/json_array_elements (that return SETOF) actually return a table that we can then join on.

User-defined table functions are currently unsupported in DataFusion (apache/datafusion#1604 / apache/datafusion#1608): we could contribute support for them and then build JSON support on top of that, or implement it using DataFusion's custom logical plan / optimizer rules if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant