Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Field #1251

Closed
fulmicoton opened this issue Jan 4, 2022 · 3 comments
Closed

JSON Field #1251

fulmicoton opened this issue Jan 4, 2022 · 3 comments

Comments

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jan 4, 2022

JSON field

This feature is motivated by the need for dynamic schema in quickwit.

A json field, has a field value of serde_json::Value.

Given a json object we emit token under the form
json path

typecode uses the following:

  • s: text
  • i: i64
  • u: u64
  • f: f64
  • d: Date

The json path suggests the use of two kinds of separator.

  • the json path is encoded using the following technique.

    Here we use codepoint 0 as a segment separator.

This choice has the benefit that to align the lexicographical order
with the depth first search order.
Note we end with a trailing \0 separator for the same reason.

We also need a separator to separate the field path from the value.
To do so we use the Record Separator codepoint (30)

For instance the following JSON

{
	"timestamp": 234243,
	"type": "start-proc"
	"body": "hello happy tax payer",
	"attr": {
       "color": "blue"
    }
}

Generates the following FieldValue.

- timestamp<\u0000><\u001e><bigendian_for_234242>
- type<\u0000><\u001e>tstart-proct
- start-proc<\u0000t><\u001e>proc
- body<\u0000\u001e>shello
- body<\u0000><\u001e>shappy
- body<\u0000><\u001e>stax
- body<\u0000><\u001e>spayer
- attr<\u0000>color<\u0000><\u001e>blue

Important note!!!
It would have been tempting to put the type before the path.
sbody<\u0000><\u001e>tax

However, on the search side there is a benefit to being able to scan all types associated with a path in a single read.

Ambiguity: Write Side

Number

Number are ambiguous in JSON.
We try to interpret them as u64, i64, f64 in that order.

This method unfortunately ambiguous.
For instance, one document could happen to have a positive value for a given field, having it mapped as an integer.
A second document could then have a negative value for the very same field.

This is a pitfall we will have to live with.
On the search side, we use the same logic to map the user value to the number type and value.

Date

Similarly to numbers, we cannot really know if a string is a date or not.

The presence of the date time implies that we do date detection.
Whatever values matches a datetime pattern will be interpreted as a date.

Ambiguity: Read side

We apply the same logic at query side.
Unfortunately, combined with tokenization, ambiguity can hit here.

Conflate

Optionally, a user can flag the field as conflated. In that case in addition to the indexing described above, we also index all token values
at the root.

Query parsing

Explicit matching

The JSON field itself has a name.

The user can query the fields explicitly as follows

json.body:hello

To target an inner struct object, the user can use the "." to extend the query.

json.attr.color:blue

Default field

If the json field is defined as a default search field, then it behaves a tad differently from other default fields.

attr.color:blue is now a match.
blue is also a match

Side effect

. becomes forbidden in a field name.
\u001E and \u0 becomes forbidden in a field value.

Two schema on the same field.

We need to record both ints and text terms in the same posting list.
They may have different recording option (we don't want position for ints) which is not supported by tantivy at the moment.

On the index writer,
We can have two posting writer, one for text and one for other stuff.
The notion of posting writer is intern to segment writer, so we can have a little bit complexity. here.

On the segment serialization & read side, we just avoid recording and serializing positions for ints.

@fulmicoton
Copy link
Collaborator Author

  • string type detector
  • number
  • indexing
  • query parser

fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 21, 2022
fulmicoton added a commit that referenced this issue Feb 22, 2022
fulmicoton added a commit that referenced this issue Feb 22, 2022
fulmicoton added a commit that referenced this issue Feb 22, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 23, 2022
fulmicoton added a commit that referenced this issue Feb 24, 2022
fulmicoton added a commit that referenced this issue Feb 24, 2022
fulmicoton added a commit that referenced this issue Feb 24, 2022
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
fulmicoton added a commit that referenced this issue Feb 24, 2022
- Removed useless copy when ingesting JSON.
- Bugfix in phrase query with a missing field norms.
- Disabled range query on default fields

Closes #1251
@adityapandey9
Copy link

Hi, @fulmicoton Range Query is not yet supported for the JSON field. It is only Term Query, right? Do you have any plans to support it?

@PSeitz
Copy link
Contributor

PSeitz commented Feb 5, 2024

All queries should support JSON eventually. PRs welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants