Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort Index/Docids By Field #1026

Merged
merged 53 commits into from
May 17, 2021
Merged

Sort Index/Docids By Field #1026

merged 53 commits into from
May 17, 2021

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Apr 29, 2021

closes #1014

add sort info to IndexSettings
generate docid mapping for sorted field (only fastfield)
remap singlevalue fastfield
move docid mapping to serialization step (less intermediate data for mapping)
add support for docid mapping in multivalue fastfield
add docid mapping for both directions old->new (used in postings) and new->old (used in fast field)
handle mapping in postings recorder
warn instead of info for MAX_TOKEN_LEN
handle index sort in docstore, by saving all the docs in a temp docstore file (SegmentComponent::TempStore). On serialization the docid mapping is used to create a docstore in the correct order by reader the old docstore.

add docstore sort tests
refactor tests
@PSeitz PSeitz requested a review from fulmicoton May 4, 2021 14:58
///
/// let mut schema_builder = Schema::builder();
/// let id_field = schema_builder.add_text_field("id", STRING);
/// let title_field = schema_builder.add_text_field("title", TEXT);
/// let body_field = schema_builder.add_text_field("body", TEXT);
/// let schema = schema_builder.build();
/// let settings = IndexSettings::default();
/// let settings = IndexSettings{sort_by_field: IndexSortByField{field:"title".to_string(), order:Order::Asc}};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is not possible currently and should be changed (or implemented)

src/core/index_meta.rs Outdated Show resolved Hide resolved
src/core/segment.rs Outdated Show resolved Hide resolved
src/fastfield/writer.rs Outdated Show resolved Hide resolved
src/fastfield/writer.rs Outdated Show resolved Hide resolved
@@ -278,7 +279,8 @@ impl Recorder for TfAndPositionRecorder {
}
if let Some(doc_id_map) = doc_id_map {
// this simple variant to remap may consume to much memory
doc_id_and_positions.push((doc_id_map.get_new_doc_id(doc), buffer_positions.to_vec()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me why lending is not an option for doc_id_and_positions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lending of buffer_positions? this is just a temp vec which gets cleared on every loop

src/indexer/merger.rs Outdated Show resolved Hide resolved
src/indexer/merger.rs Outdated Show resolved Hide resolved
fix deserialization
update changelog
forward error on merge failed
cache store readers, to utilize lru cache (4x faster performance, due to less decompress calls on the block)
unset flag on deserialization and after finalize of a segment
set flag when creating new instances
@PSeitz PSeitz changed the title Sort Index/Docids By Field (WIP) Sort Index/Docids By Field May 17, 2021
// not just stacked. The field serializer expects monotonically increasing
// docids, so we collect and sort them first, before writing.
//
// I think this is not strictly necessary, it would be possible to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds very expensive. Did you test it on a dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging was 2x slower with sorting on a dataset I tested, but I have a change in the pipeline, which will increase merge speed in both cases :).
Currently the overhead of the sorting is the same as in recorder.rs

Copy link
Contributor Author

@PSeitz PSeitz May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0,9% of the merge time is spend in the vector push of the posting, in the webserver dataset (tokenized url field with positions)

let term_freq = u32_it.next().unwrap_or(self.current_tf);
serializer.write_doc(doc as u32, term_freq, &[][..]);
if let Some(doc_id_map) = doc_id_map {
let mut doc_id_and_tf = vec![];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was lending not possible here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bufferlender returns u32, but we need (u32, u32). I could extend buffer lender though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Or push them two by two.

) {
let (buffer_u8, buffer_positions) = buffer_lender.lend_all();
self.stack.read_to_end(heap, buffer_u8);
let mut u32_it = VInt32Reader::new(&buffer_u8[..]);
let mut doc_id_and_positions = vec![];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was lending not possible here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is of type (u32, Vec), we can't lend the Vec, since it needs to be owned.

I should also note, that I couldn't find any recorder serialize methods in the perf profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sort DocId by key
2 participants