Re enable IVF random sampling #2225

tfeher · 2024-03-14T08:40:48Z

Random sampling of training set for IVF methods was reverted in #2144 due to the large memory usage of the subsample method.

PR #2155 implements a new random sampling method. Using that we can now enable random sampling of IVF methods (#2052 and #2077), therefore this PR reverts #2144, and adjust the code to utilize the new sampling method.

This reverts commit 7edd372. Subsampling is implemented in a separate PR, therefore the reletad code was removed while reverting.

tfeher · 2024-03-14T08:42:47Z

Note this will only compile once #2155 is merged.

…vf_random_sampling

…_random_sampling

achirkin

LGTM. Love to see the IVF build code shrinking!

cpp/include/raft/cluster/detail/kmeans_balanced.cuh

cpp/include/raft/neighbors/detail/ivf_pq_build.cuh

cjnolet · 2024-04-22T14:54:10Z

cpp/include/raft/neighbors/ivf_pq_types.hpp

+   * PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training
+   * points to train each codebook.
+   */
+  uint32_t max_train_points_per_pq_code = 256;


Why 256 here? Have we tested this empirically across many datasets ti verify this is a good default?

The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.

tfeher

Closing this in favor of rapidsai/cuvs#122

cpp/include/raft/neighbors/detail/ivf_pq_build.cuh

tfeher · 2024-05-15T23:13:14Z

cpp/include/raft/neighbors/ivf_pq_types.hpp

+   * PER_CLUSTER. In both cases, we will use `pq_book_size * max_train_points_per_pq_code` training
+   * points to train each codebook.
+   */
+  uint32_t max_train_points_per_pq_code = 256;


The default value is inspired by FAISS, which also has 256 as default. We tested on DEEP-100M here #2052 (comment). I will share results on other datasets.

tfeher added 14 commits February 5, 2024 11:51

Make subsampling use less memory

3f5e149

Add subsample benchmark

1d2a681

Merge branch 'branch-24.04' into ivf_subsample2

cabd94f

debug

4040a96

Fix bug

e09c9f7

add tests

a6f9083

cleanup

941e165

added sample_rows to matrix namespace

eb73ef5

add test for sample rows

cc2cf24

Add mdspan input API, fix cmakelists

eb7e6d1

corrections

7857f2f

Add test to sample_rows

93ff94f

Revert "[HOTFIX] 24.02 Revert Random Sampling (rapidsai#2144)"

f2c28ce

This reverts commit 7edd372. Subsampling is implemented in a separate PR, therefore the reletad code was removed while reverting.

Use the new matrix::sample_rows API

47eefd4

tfeher requested review from a team as code owners March 14, 2024 08:40

github-actions bot added cpp python labels Mar 14, 2024

tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp python labels Mar 14, 2024

tfeher self-assigned this Mar 14, 2024

tfeher requested a review from achirkin March 14, 2024 08:42

tfeher added 4 commits March 15, 2024 12:41

Address issues

3f9cbc3

change member variables in test to local vars

57cb99c

Fix omp gather and add bench

84e307e

Merge branch 'branch-24.04' into ivf_subsample2

1dd9e13

tfeher added 2 commits March 18, 2024 15:10

Merge remote-tracking branch 'tfeher/ivf_subsample2' into re_enable_i…

c369149

…vf_random_sampling

Merge remote-tracking branch 'origin/branch-24.04' into re_enable_ivf…

4ced8c4

…_random_sampling

tfeher requested a review from a team as a code owner March 18, 2024 20:36

github-actions bot added cpp CMake python labels Mar 18, 2024

tfeher added 3 commits March 19, 2024 00:33

Adjust comment

84609de

Fix params for sample_rows

6ab5f9a

Change IVF cluster warning messages to debug msg

f01fa61

achirkin approved these changes Mar 19, 2024

View reviewed changes

cpp/include/raft/cluster/detail/kmeans_balanced.cuh Show resolved Hide resolved

cpp/include/raft/neighbors/detail/ivf_pq_build.cuh Show resolved Hide resolved

Merge branch 'branch-24.04' into re_enable_ivf_random_sampling

739ff05

github-actions bot removed the CMake label Mar 19, 2024

tfeher added 2 commits March 19, 2024 20:19

Remove changes from ann_utils.cuh

28a0ed7

allocate trainset usind default allocator

d88e2d3

tfeher changed the base branch from branch-24.04 to branch-24.06 March 21, 2024 23:37

tfeher added 3 commits March 22, 2024 00:38

Merge branch 'branch-24.06' into re_enable_ivf_random_sampling

66e696a

Merge branch 'branch-24.06' into re_enable_ivf_random_sampling

4d7d7dd

Merge branch 'branch-24.06' into re_enable_ivf_random_sampling

bf67643

cjnolet requested changes Apr 22, 2024

View reviewed changes

tfeher commented May 15, 2024

View reviewed changes

tfeher closed this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re enable IVF random sampling #2225

Re enable IVF random sampling #2225

tfeher commented Mar 14, 2024

tfeher commented Mar 14, 2024

achirkin left a comment

cjnolet Apr 22, 2024

tfeher May 15, 2024

tfeher left a comment

tfeher May 15, 2024

Re enable IVF random sampling #2225

Re enable IVF random sampling #2225

Conversation

tfeher commented Mar 14, 2024

tfeher commented Mar 14, 2024

achirkin left a comment

Choose a reason for hiding this comment

cjnolet Apr 22, 2024

Choose a reason for hiding this comment

tfeher May 15, 2024

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher May 15, 2024

Choose a reason for hiding this comment