Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added datetime support #1396

Merged
merged 1 commit into from Jul 12, 2022
Merged

Added datetime support #1396

merged 1 commit into from Jul 12, 2022

Conversation

evanxg852000
Copy link
Contributor

@evanxg852000 evanxg852000 commented Jun 22, 2022

Implements the new DateTime field supporting feature in Tantivy.
Closes #1408

@evanxg852000
Copy link
Contributor Author

evanxg852000 commented Jun 23, 2022

Questions: @fulmicoton

@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2022

Codecov Report

Merging #1396 (e61a59f) into main (2406d92) will decrease coverage by 0.00%.
The diff coverage is 90.53%.

@@            Coverage Diff             @@
##             main    #1396      +/-   ##
==========================================
- Coverage   94.23%   94.22%   -0.01%     
==========================================
  Files         236      237       +1     
  Lines       43685    43939     +254     
==========================================
+ Hits        41165    41400     +235     
- Misses       2520     2539      +19     
Impacted Files Coverage Δ
fastfield_codecs/src/bitpacked.rs 100.00% <ø> (ø)
src/fastfield/multivalued/mod.rs 97.01% <ø> (ø)
src/query/more_like_this/more_like_this.rs 67.54% <0.00%> (ø)
src/schema/flags.rs 47.36% <0.00%> (-8.89%) ⬇️
src/schema/mod.rs 100.00% <ø> (ø)
src/lib.rs 96.48% <85.48%> (-1.14%) ⬇️
src/schema/field_type.rs 80.75% <90.00%> (-0.14%) ⬇️
src/fastfield/mod.rs 94.41% <90.90%> (-0.10%) ⬇️
src/schema/date_time_options.rs 91.33% <91.33%> (ø)
src/aggregation/intermediate_agg_result.rs 98.23% <100.00%> (+0.01%) ⬆️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2406d92...e61a59f. Read the comment docs.

@fulmicoton
Copy link
Collaborator

  • I will handle Date field deprecation in a following PR. The idea is to make use of the new DateTime field underneath with defaults that correspond to deprecating Date field. WDT?

Yes if it is possible that would be great. If it is not possible let me know.

If possible, I'd keep DateTime without timezone in tantivy and have a notion of timezone in quickwit... but I might not have thought this through so let me know what would be the obstacle. (Query parsing I assume?)

@evanxg852000
Copy link
Contributor Author

evanxg852000 commented Jun 24, 2022

If possible, I'd keep DateTime without timezone in tantivy and have a notion of timezone in quickwit... but I might not have thought this through so let me know what would be the obstacle. (Query parsing I assume?)

We can leave timezone thing for quickwit, this just means we will let the Tantivy users know that when a timezone is not specified in the input, we assume UTC. I think that's also good for a library

fn main() -> tantivy::Result<()> {
// # Defining the schema
let mut schema_builder = Schema::builder();
let mut date_formats = HashSet::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surely this needs to be a ordered. HashSet is a very bad idea here.

@@ -58,13 +65,14 @@ pub enum Type {
Json = b'j',
}

const ALL_TYPES: [Type; 9] = [
const ALL_TYPES: [Type; 10] = [
Type::Str,
Type::U64,
Type::I64,
Type::F64,
Type::Bool,
Type::Date,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old Date type was actually a DateTime (yes the name sucks).

Is the plan to add DateTime and deprecate the older Date, or to replace right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all these two types DateTime and PreciseDateTime exist because my plan is to keep Type::Date working with a deprecated label. If we can drop the old DateTime type and accept the breaking change let me know your opinion.


// Accepted input format for parsing input as DateTime.
#[serde(default = "default_input_formats")]
input_formats: HashSet<DateTimeFormat>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be an HashSet. Order matters here.

}

impl fmt::Debug for PreciseDateTime {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {

OffsetDateTime::from_unix_timestamp_nanos(timestamp as i128)
}
}
.map_err(|error| error.to_string())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map_err(|error| error.to_string())
.map_err(ToString::to_string)

}

/// Convert to UTC `OffsetDateTime`
pub fn into_utc(self) -> OffsetDateTime {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

}

/// Returns the underlying timestamp.
pub fn get_timestamp(&self) -> i64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name does not make it clear that this is with the given precision...
get_truncated_timestamp get_inner_timestamp...

/// Converts to the underlying timestamp.
/// TODO: kept for backward compatibility, should we remove it?
pub const fn into_timestamp(self) -> i64 {
let Self { timestamp, .. } = self;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.timestamp would have worked too

}

impl DateTime {
/// Create new instance from UNIX timestamp.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create new instance from UNIX timestamp.
/// Create new instance from UNIX timestamp (in seconds).

(I agree UNIX timestamp generally suggests seconds, but we might as well be explicit here)

input_formats: HashSet<DateTimeFormat>,

// Internal storage precision, used to avoid storing
// very large numbers when not needed. This optimizes compression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it apply to

  • indexing
  • stored
  • fast
  • index sorting
    or only
  • fast
    or only
    -fast
    -indexing?

It is not an obvious question and it should be detailed here.


impl From<DateTimeOptions> for DateTimeParsersHolder {
fn from(opts: DateTimeOptions) -> Self {
let mut string_parsers: VecDeque<StringDateTimeParser> = VecDeque::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not Vec?

for (index, parser) in self.string_parsers.iter().enumerate() {
if let Ok(date_time) = parser(&value) {
// Move successful parser in front of queue.
self.string_parsers.swap(0, index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see...
I think Vec has swap too.

I am not sure this is a good idea.

So with the HashSet defintion, the output of quickwit right now is not really deterministic at compile time.
Date format can be ambiguous ( think about MM-DD-YYY vs DD-MM-YYYY and 01/03/2021).
Which means that our output depends on the order of our date formats.

By using a HashSet, you make the output of quickwit dependent on the implementation of hashset.
(A change in the rust standard library fro instance, would break unit tests.)
We need to stay away from HashSet.

I would also stay away from the reordering logic, but I wonder...
So it has two functions I believe:

  • a) performance.
  • b) best effort to deal with the ambiguity described above.

a) can be partially addressed by extracting ez to check predicate for each format.
For instance on the date string length.

  • If the date is not exactly 10chrs. skip format XXXXX.

b) is tricky. With your solution we can "eventually" find the right format when we reach a date that is unambiguous (e.g. 24/02/2021). On the other hand, I think we should give all the tools to the user to tell use the right thing. I'd rather not do the swapping logic and just rely on the priority logic.

#[derive(Clone)]
pub struct DateTimeParsersHolder {
/// Functions for parsing string input as datetime.
string_parsers: VecDeque<StringDateTimeParser>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec?

pub DateTime,
);

impl Deref for PreciseDateTime {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the inner DateTime is incorrect right? The precision here is arbitrary.
We don't want to expose it.

// It only exists to provide support the underlying fast
// field value for both `[Value::Date]` and `[Value::DateTime]` variants.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct PreciseDateTime(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to TruncatedTimestamp maybe.

fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.debug_struct("DateTime")
.field("timestamp", &self.0.timestamp)
.field("precision", &self.0.precision)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the precision is wrong isn't it? Also we don't want to print DateTime here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precision is always correct, whether you use DateTime::from_unix_timestamp(u64) or DateTime::from_utc_with_precision(utc_dt, precison) constructor. Maybe the names should be:

  • DateTime -> DateTimeWithSecondPrecision (old DateTime)
  • PreciseDateTime -> DateTimeWithCustomPrecision (for new fast field)

@fulmicoton
Copy link
Collaborator

I don't understand the logic around PreciseDateTime.

In the FastField we store truncated values I assume? (e.g. a milliseconds precision timestamp for instance).
Then I understand that you want a type on which you can implement FastValue. This means you create the type by taking the int without really knowing the precision. The schema is then needed to deal with this.
Then why do we expose the inner DateTime? That seems error-prone.

@evanxg852000
Copy link
Contributor Author

evanxg852000 commented Jun 28, 2022

So PreciseDateTime is the new date time with actual precision, currently, DateTime API was enhanced to conform to the old DateTime that's why I kept the old API and added few things. This allowed using DateTime like before and not breaking existing code. PreciseDateTime is the one that actually makes use of the new API. As I said in the comment it only exist because I could implement both fastfield types for the same DateTime. I don't want to rip something without us agreeing so DateTime behaves as the way it use to with a precison -> seconds and the timestamp is second. when using the DateTime field, precison can change to millis or ... .
We always store the computed timestamp value then when extracting the date the precision is used.
In both DateTime (seconds precision) and PreciseDateTime (any precision) the precison is always correct. I think I need to find a better name.
By exposing the inner DateTime, user only needs to worry about that type that already existed.

@fulmicoton
Copy link
Collaborator

In both DateTime (seconds precision) and PreciseDateTime (any precision) the precison is always correct.

@evanxg852000 I don't see how this is possible... In the FastValue of PreciseDateTime, from assumes that we have seconds in the fast field, and the conversion to a fastvalue is taking whatever truncated value we had.

@@ -325,7 +327,7 @@ impl<'a> JsonTermWriter<'a> {
self.close_path_and_set_type(T::to_type());
self.term_buffer
.as_mut()
.extend_from_slice(val.to_u64().to_be_bytes().as_slice());
.extend_from_slice(val.to_u64(None).to_be_bytes().as_slice());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fulmicoton I am not applying the precision here to DateTime. my reasoning is that this is all indexing though we are pushing fast values. Am I wrong?

Copy link
Collaborator

@fulmicoton fulmicoton Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question!

FastValue is a poorly named trait.
It is not only used for fast values, it is also used for the inverted index.

In reality it offers a strictly monotonic mapping from the type to U64 (and its reciprocal).

We could be named MonoMapU64 or something like that.
Can you fill a ticket to do this rename?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to your question... in the json_term_writer we should apply some truncation on dates.

I suggest we always truncate with second precision?

We should probably truncate to seconds always for regular (=not Json) date fields too; and only use the precision setting for fast fields.

Besides this is the current behavior of tantivy in a sense.

@evanxg852000 evanxg852000 changed the title Added datetime support [W.I.P] Added datetime support Jul 7, 2022
@evanxg852000 evanxg852000 marked this pull request as ready for review July 7, 2022 16:36
src/lib.rs Outdated
}

/// Convert to UNIX timestamp in microseconds.
pub const fn into_timestamp_nanos(self) -> i64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should ditch the nanos one. It could suggest we have nanos precision.

src/lib.rs Outdated
/// Create new from UNIX timestamp in nanoseconds.
pub const fn from_timestamp_nanos(nanoseconds: i128) -> Self {
Self {
timestamp_micros: (nanoseconds as i64 / 1_000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should divide into an i128 first and then convert to i64. See for yourself. Can you add the corresponding unit test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inlined it and added a test for it.

src/lib.rs Outdated
}

/// Create new from UNIX timestamp in nanoseconds.
pub const fn from_timestamp_nanos(nanoseconds: i128) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove that one (and inline the implem in from_utc)

@@ -62,7 +63,9 @@ pub trait FastValue: Clone + Copy + Send + Sync + PartialOrd + 'static {
/// Converts a value to u64.
///
/// Internally all fast field values are encoded as u64.
fn to_u64(&self) -> u64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you use the field type for truncation on datetime, but it seems too convoluted. Let's not change the to_u64.

Instead, let's put the truncation at the fast field writer level.

@@ -305,7 +335,7 @@ impl IntFastFieldWriter {
pub fn add_document(&mut self, doc: &Document) {
match doc.get_first(self.field) {
Some(v) => {
self.add_val(super::value_to_u64(v));
self.add_val(super::value_to_u64(v, &self.field_type));
Copy link
Collaborator

@fulmicoton fulmicoton Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the truncation logic here.

let mut val = 
    super::value_to_u64(v, self.field_type);
if let Some(precision) = self.precision_opt {
    val = precision.truncate(val);
}

What's a bit lame is that we will have to do it elsewhere for indexing.

@@ -248,7 +248,7 @@ impl SegmentWriter {
FieldType::Date(_) => {
for value in values {
let date_val = value.as_date().ok_or_else(make_schema_error)?;
term_buffer.set_u64(date_val.to_u64());
term_buffer.set_u64(date_val.to_u64(None));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in json_term_writer.
We want some truncation here.

I suggest we truncate to the second all of the time.


/// DateTime Precision
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
pub enum DatePrecision {
Copy link
Collaborator

@fulmicoton fulmicoton Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"precision": "Milliseconds"
does not look great in a JSON.

`"precision": "milliseconds" would look better IMHO

can you add attriubte to make the serde serialization look nicer?
(serde lower should do the job.)

@@ -72,7 +72,7 @@ impl HistogramComputer {
return;
}
let delta = value - self.min_value;
let delta_u64 = delta.to_u64();
let delta_u64 = delta.to_u64(None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove that line anyway I think.

delta_u64 = delta

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments inline

@evanxg852000 evanxg852000 force-pushed the support-datetime-field branch 2 times, most recently from 506197e to 22df8d6 Compare July 10, 2022 16:01

impl Default for DateOptions {
fn default() -> Self {
Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have clippy warnings here. Either use derive(Default) or add an annotation to make sure Clippy keeps quiet.
I'm ok with both solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used Default, shorter & the recommended Clippy way

@fulmicoton
Copy link
Collaborator

Good job!

@fulmicoton fulmicoton merged commit a4be239 into main Jul 12, 2022
@fulmicoton fulmicoton deleted the support-datetime-field branch July 12, 2022 01:04
@twe4ked
Copy link

twe4ked commented Sep 17, 2022

Thanks for this! Any chance of a release with this included soon? 🙏

@PSeitz
Copy link
Contributor

PSeitz commented Sep 18, 2022

@twe4ked The next release will be a breaking release, so we try to add more stuff into it. Maybe a release in around 1 month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add precision to datetime field and datetime field internally expressed in microsecs.
6 participants