JSON Field #1251

fulmicoton · 2022-01-04T07:23:39Z

JSON field

This feature is motivated by the need for dynamic schema in quickwit.

A json field, has a field value of serde_json::Value.

Given a json object we emit token under the form
json path

typecode uses the following:

s: text
i: i64
u: u64
f: f64
d: Date

The json path suggests the use of two kinds of separator.

the json path is encoded using the following technique.

Here we use codepoint 0 as a segment separator.

This choice has the benefit that to align the lexicographical order
with the depth first search order.
Note we end with a trailing \0 separator for the same reason.

We also need a separator to separate the field path from the value.
To do so we use the Record Separator codepoint (30)

For instance the following JSON

{
	"timestamp": 234243,
	"type": "start-proc"
	"body": "hello happy tax payer",
	"attr": {
       "color": "blue"
    }
}

Generates the following FieldValue.

- timestamp<\u0000><\u001e><bigendian_for_234242>
- type<\u0000><\u001e>tstart-proct
- start-proc<\u0000t><\u001e>proc
- body<\u0000\u001e>shello
- body<\u0000><\u001e>shappy
- body<\u0000><\u001e>stax
- body<\u0000><\u001e>spayer
- attr<\u0000>color<\u0000><\u001e>blue

Important note!!!
It would have been tempting to put the type before the path.
sbody<\u0000><\u001e>tax

However, on the search side there is a benefit to being able to scan all types associated with a path in a single read.

Ambiguity: Write Side

Number

Number are ambiguous in JSON.
We try to interpret them as u64, i64, f64 in that order.

This method unfortunately ambiguous.
For instance, one document could happen to have a positive value for a given field, having it mapped as an integer.
A second document could then have a negative value for the very same field.

This is a pitfall we will have to live with.
On the search side, we use the same logic to map the user value to the number type and value.

Date

Similarly to numbers, we cannot really know if a string is a date or not.

The presence of the date time implies that we do date detection.
Whatever values matches a datetime pattern will be interpreted as a date.

Ambiguity: Read side

We apply the same logic at query side.
Unfortunately, combined with tokenization, ambiguity can hit here.

Conflate

Optionally, a user can flag the field as conflated. In that case in addition to the indexing described above, we also index all token values
at the root.

Query parsing

Explicit matching

The JSON field itself has a name.

The user can query the fields explicitly as follows

json.body:hello

To target an inner struct object, the user can use the "." to extend the query.

json.attr.color:blue

Default field

If the json field is defined as a default search field, then it behaves a tad differently from other default fields.

attr.color:blue is now a match.
blue is also a match

Side effect

. becomes forbidden in a field name.
\u001E and \u0 becomes forbidden in a field value.

Two schema on the same field.

We need to record both ints and text terms in the same posting list.
They may have different recording option (we don't want position for ints) which is not supported by tantivy at the moment.

On the index writer,
We can have two posting writer, one for text and one for other stuff.
The notion of posting writer is intern to segment writer, so we can have a little bit complexity. here.

On the segment serialization & read side, we just avoid recording and serializing positions for ints.

The text was updated successfully, but these errors were encountered:

fulmicoton · 2022-01-05T02:38:13Z

string type detector
number
indexing
query parser

Closes #1251

- Removed useless copy when ingesting JSON. - Disabled range query on default fields Closes #1251

- Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251

adityapandey9 · 2024-02-05T03:05:51Z

Hi, @fulmicoton Range Query is not yet supported for the JSON field. It is only Term Query, right? Do you have any plans to support it?

PSeitz · 2024-02-05T03:34:16Z

All queries should support JSON eventually. PRs welcome

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON type

ec98aa3

Closes #1251

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON Type

a3142d9

Closes #1251

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON Type

b4c67b0

Closes #1251

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON Type

307124d

Closes #1251

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON Type

f62a393

Closes #1251

fulmicoton added a commit that referenced this issue Feb 21, 2022

Added JSON Type

5faa4c1

Closes #1251

fulmicoton added a commit that referenced this issue Feb 22, 2022

Added JSON Type

80c9f2e

Closes #1251

fulmicoton added a commit that referenced this issue Feb 22, 2022

Added JSON Type

0b5be9d

Closes #1251

fulmicoton added a commit that referenced this issue Feb 22, 2022

Added JSON Type

dc253cd

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

66171c6

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

a0b14cf

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

b9ff6df

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

a6c7fac

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

6165f5f

Closes #1251

fulmicoton added a commit that referenced this issue Feb 23, 2022

Added JSON Type

c1d4cc2

Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

d489464

Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

1874da9

Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

90ffb6b

Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

5638ffc

- Removed useless copy when ingesting JSON. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

07a438d

- Removed useless copy when ingesting JSON. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

d8f4260

- Removed useless copy when ingesting JSON. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

22443ba

- Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

d3a6d60

- Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

a3fd6bb

- Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251

fulmicoton added a commit that referenced this issue Feb 24, 2022

Added JSON Type

24aa372

- Removed useless copy when ingesting JSON. - Bugfix in phrase query with a missing field norms. - Disabled range query on default fields Closes #1251

fulmicoton closed this as completed in d7b46d2 Feb 24, 2022

aalexandrov mentioned this issue Aug 26, 2024

Support FuzzyTerm, Regex, PhrasePrefix queries on JSON values paradedb/paradedb#1553

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Field #1251

JSON Field #1251

fulmicoton commented Jan 4, 2022 •

edited

Loading

fulmicoton commented Jan 5, 2022

adityapandey9 commented Feb 5, 2024

PSeitz commented Feb 5, 2024

JSON Field #1251

JSON Field #1251

Comments

fulmicoton commented Jan 4, 2022 • edited Loading

JSON field

Ambiguity: Write Side

Number

Date

Ambiguity: Read side

Conflate

Query parsing

Explicit matching

Default field

Side effect

Two schema on the same field.

fulmicoton commented Jan 5, 2022

adityapandey9 commented Feb 5, 2024

PSeitz commented Feb 5, 2024

fulmicoton commented Jan 4, 2022 •

edited

Loading