Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add iterator over documents in docstore #1044

Merged
merged 2 commits into from
May 18, 2021
Merged

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented May 17, 2021

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.

Merge Time on Sorted Index Before/After:
24s / 19s

Merge Time on Unsorted Index Before/After:
15s / 13,5s

So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.

Merge Time on Sorted Index Before/After:
24s / 19s

Merge Time on Unsorted Index Before/After:
15s / 13,5s

So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.
@PSeitz PSeitz requested a review from fulmicoton May 17, 2021 17:45
let store_reader = &store_readers[reader_with_ordinal.ordinal as usize];
let raw_doc = store_reader.get_raw(*old_doc_id)?;
let store_reader = &mut document_iterators[reader_with_ordinal.ordinal as usize];
let raw_doc = store_reader.next().expect(&format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return an error here? (the error message is great.)

@fulmicoton
Copy link
Collaborator

a) this is great.
b) @ppodolsky will like that.

/// Iterator over all RawDocuments in their order as they are stored in the doc store.
/// Use this, if you want to extract all Documents from the doc store.
/// The delete_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
pub fn iter_raw<'a: 'b, 'b>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be pub(crate)?

let mut num_skipped = 0;
(0..last_docid)
.filter_map(move |doc_id| {
// filter_map is only used to resolve lifetime issues between the two closures on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am incapable of reading this :)

But I think this is ok considering it has no adherence outside of this function

@fulmicoton fulmicoton merged commit a400262 into quickwit-oss:main May 18, 2021
This was referenced Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants