Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State-of-the-art pretrained model for sentence similarity/clustering? #2600

Open
raphael-milliere opened this issue Apr 17, 2024 · 1 comment

Comments

@raphael-milliere
Copy link

I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl, given that I'm not constrained by compute?

Thanks in advance!

@tomaarsen
Copy link
Collaborator

tomaarsen commented Apr 17, 2024

Hello!

Although the original sentence-transformers models like all-mpnet-base-v2 hold up quite well, recent community models like mxbai-embed-large-v1 should indeed outperform it. You can check for Sentence Similarity/Clustering on MTEB (and filter away >1B models, probably), and you'll get a good idea of what should work well.

You're on the right track :)

Oh, one last note: if you have some evals/tests ready for BERTopic, then you can always experiment with a few different models. They're mostly fairly small and efficient, so it should be quite simple to try out a few to get a feel for them. No leaderboard will ever beat running models on your own data.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants