You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.
According to the official pre-trained model evaluations, all-mpnet-base-v2 is best overall, while sentence-t5-xxl is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?
Looking at the MTEB leaderboard, mxbai-embed-large-v1 appears to be the leading open weights model currently. Should I expect this model to be superior to all-mpnet-base-v2 or sentence-t5-xxl, given that I'm not constrained by compute?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Although the original sentence-transformers models like all-mpnet-base-v2 hold up quite well, recent community models like mxbai-embed-large-v1 should indeed outperform it. You can check for Sentence Similarity/Clustering on MTEB (and filter away >1B models, probably), and you'll get a good idea of what should work well.
You're on the right track :)
Oh, one last note: if you have some evals/tests ready for BERTopic, then you can always experiment with a few different models. They're mostly fairly small and efficient, so it should be quite simple to try out a few to get a feel for them. No leaderboard will ever beat running models on your own data.
I've been looking for up-to-date information about how various pre-trained models fare for sentence similarity and clustering tasks (e.g. with BERTopic), rather than semantic search.
According to the official pre-trained model evaluations,
all-mpnet-base-v2
is best overall, whilesentence-t5-xxl
is best for sentence similarity. However, both of these models are quite old. Surely there are better pre-trained models available for sentence similarity and/or clustering?Looking at the MTEB leaderboard,
mxbai-embed-large-v1
appears to be the leading open weights model currently. Should I expect this model to be superior toall-mpnet-base-v2
orsentence-t5-xxl
, given that I'm not constrained by compute?Thanks in advance!
The text was updated successfully, but these errors were encountered: