Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re enable IVF random sampling #2225

Closed

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Mar 14, 2024

Random sampling of training set for IVF methods was reverted in #2144 due to the large memory usage of the subsample method.

PR #2155 implements a new random sampling method. Using that we can now enable random sampling of IVF methods (#2052 and #2077), therefore this PR reverts #2144, and adjust the code to utilize the new sampling method.

@tfeher tfeher requested review from a team as code owners March 14, 2024 08:40
@tfeher tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp python labels Mar 14, 2024
@tfeher tfeher self-assigned this Mar 14, 2024
@tfeher tfeher requested a review from achirkin March 14, 2024 08:42
@tfeher
Copy link
Contributor Author

tfeher commented Mar 14, 2024

Note this will only compile once #2155 is merged.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Love to see the IVF build code shrinking!

@github-actions github-actions bot removed the CMake label Mar 19, 2024
@tfeher tfeher changed the base branch from branch-24.04 to branch-24.06 March 21, 2024 23:37
* PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training
* points to train each codebook.
*/
uint32_t max_train_points_per_pq_code = 256;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 256 here? Have we tested this empirically across many datasets ti verify this is a good default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.

Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this in favor of rapidsai/cuvs#122

* PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training
* points to train each codebook.
*/
uint32_t max_train_points_per_pq_code = 256;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.

@tfeher tfeher closed this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python Vector Search
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants