Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feautre] Add vector similarity search support #16603

Open
AKlaus opened this issue Jun 6, 2023 · 3 comments
Open

[Feautre] Add vector similarity search support #16603

AKlaus opened this issue Jun 6, 2023 · 3 comments

Comments

@AKlaus
Copy link

AKlaus commented Jun 6, 2023

If I want to store feature vectors (a numeric array, e.g. [2.01, 20.85, 14.05]) in the DB, I'd like to query other records (with arrays of the same dimension) similar to the selected one(s) with a calculated similarity score (e.g. so I could tell that array 1 is similar to array 2 by 80%).

It's expected that the calculation of the resultset would be based on a Nearest neighbor search (e.g. knn algorithm with cosine similarity as the most popular implementation, but there're many others).

Current options in Raven

RavenDB already provides a Vector Index mainly used for text similarities. There may be a way to extend it to handle numeric values. It seems nothing is coming out-of-the-box.

Other DB solutions:

  • CosmosDB offers a Vector Index for calculating similarity scores. I like that it has options for similarity metric (cosine, euclidian distance), but unfortunately it has a very strict limit on the dimensions (2,000 max). BTW, CosmosDB offers pgvector on Azure Cosmos DB for PostgreSQL that also can be used for the purpose.
  • pgvector, a PostgreSQL extension, even with a .NET Lib, supporting up to 16K vector dimensions.
  • Milvus a vector database, written in GO, has UI as 3-party project and a .NET Lib, supporting up to 32K vector dimensions.
  • Weaviate a vector DB written in GO with a GraphQL interface and a .NET Lib.
  • Qdrant a vector DB written in Rust, 10K stars, has paid cloud), no .NET Lib
    Vald (ANN algo only, 1K stars, Go), no .NET Lib
@ayende
Copy link
Member

ayende commented Jun 6, 2023

Hi,
We are gearing up to the 6.0 release, expected in about 2 - 3 months.
That means that we'll only be able to look into that on 2024 timeframe.

The vector option you are talking about there isn't used for this, I'm afraid.

For reference, you can use something like: https://github.com/nmslib/hnswlib

Or you can try using simhash and then suggestions on the expected value.

The key issue from our perspective is how to index that so we'll not need to scan through all the results.

@AKlaus
Copy link
Author

AKlaus commented Jun 7, 2023

Thank you for the transparency.

I agree that the index part is the trickiest (sure, it's based on calculating hashes on the vectors, but the devil's in the detail). Then there's a vast field of all possible algorithms for calculating the similarity scores. Predicting what might be the most widely used approach is hard, and I'm no expert, but at first glance, the CosmosDB's and pgvector's approaches would satisfy the majority of cases.

Thanks for the link. nmslib seems to be good for approximate search methods (I haven't used it, though).
If I were to use Python, I'd go with sklearn.neighbors package that comes with kNN algorithm out-of-the-box and supports aNN (approximate nearest neighbours) via nmslib plugin.

As for SimHash (also MinHash and other hash-based algorithms), my gut feeling is they require significant work in tuning the hash function (introduce weights into the computation, etc.) to handle my feature vectors. Nonetheless, they aren't off the table.

@VenkateshSrini
Copy link

@ayende ,
With all the GENAI in place, chat is becoming one of the user interaction points for application and that is when we need this to be present. Mongo is building it using Atlas but then I'm facing lots of hurdles to even test the same. I think RavenDB is sweet spot if it can get this out and make dev life easy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants