State-of-the-art pretrained model for sentence similarity/clustering? #2600

raphael-milliere · 2024-04-17T15:48:15Z

I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.

According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?

Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl, given that I'm not constrained by compute?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

tomaarsen · 2024-04-17T16:26:09Z

Hello!

Although the original sentence-transformers models like all-mpnet-base-v2 hold up quite well, recent community models like mxbai-embed-large-v1 should indeed outperform it. You can check for Sentence Similarity/Clustering on MTEB (and filter away >1B models, probably), and you'll get a good idea of what should work well.

You're on the right track :)

Oh, one last note: if you have some evals/tests ready for BERTopic, then you can always experiment with a few different models. They're mostly fairly small and efficient, so it should be quite simple to try out a few to get a feel for them. No leaderboard will ever beat running models on your own data.

Tom Aarsen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State-of-the-art pretrained model for sentence similarity/clustering? #2600

State-of-the-art pretrained model for sentence similarity/clustering? #2600

raphael-milliere commented Apr 17, 2024

tomaarsen commented Apr 17, 2024 •

edited

State-of-the-art pretrained model for sentence similarity/clustering? #2600

State-of-the-art pretrained model for sentence similarity/clustering? #2600

Comments

raphael-milliere commented Apr 17, 2024

tomaarsen commented Apr 17, 2024 • edited

tomaarsen commented Apr 17, 2024 •

edited