most_similar() return the k most similar vectors #4364

bintay · 2019-10-02T22:28:41Z

Description

Modified the most_similar method of Vectors so that it returns the top n most similar vectors rather than the single most similar vector.

The new value returned looks like (keys, best_rows, scores) where keys, best_rows, and scores are arrays containing an array for each query. So scores[0] would be an array of scores corresponding to the n vectors most similar to the 0th query.

In theory, performance should not be impacted much — argmax is replaced by argpartition, both of which run in linear time with respect to the number of elements. If the user chooses to sort the n most similar entries by setting the sort parameter to true, it will add an O(n lg n) step to sort the entries. However, I haven't tested the performance in practice yet and I might be missing something.

This could break existing code since the return type is changed from a tuple of three 1d arrays to a tuple of three 2d arrays. We could have it use the old method when n = 1 and the new method otherwise, but then it could be confusing to use & documentation could get messy. Could also keep the old method and add a new n_most_similar method.

This is my first time contributing to an open source project, so let me know your thoughts!

Types of change

Enhancement (#3697)

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

ines · 2019-10-02T23:10:49Z

Thank so much 🙏 This is really good timing btw, because I was just working on a refactor of sense2vec, where I want to use that method properly.

Also, I think that one test is failing because Vocab.prune_vectors calls into Vectors.most_similar, so we just need to adjust the method call there.

bintay · 2019-10-03T01:55:02Z

Yeah it's because scores[i], etc are arrays now but it was expecting a number. Changing them to scores[i][0] fixed that error.

Now the test is actually failing — neighbour is "dog" instead of "cat". Because all 3 of the vectors in the test are just scaled versions of each other, their cosine similarity is 1 so either should be valid. I guess argmax just happened to return "cat" whereas argpartition returns "dog".

I updated the test with more "realistic" vectors for now. Let me know if we actually want to test the case where all the vectors point in the same direction.

honnibal · 2019-10-03T12:09:40Z

Thanks! Yeah I think the change to the test is good, that makes sense. Merging 🎉

bintay added 5 commits October 2, 2019 16:47

most_similar return n-most similar vectors

718dee1

updated most_similar comment

f86a587

add bintay contributor agreement

57268f7

sign bintay contributor agreement

212aa46

fix most_similar documentation typo

6d9add2

ines added enhancement feat / vectors labels Oct 2, 2019

bintay added 2 commits October 2, 2019 20:34

fixed error in prune_vectors

2bd3882

updated prune_vectors test

17a217e

honnibal merged commit 1db79a3 into explosion:master Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

most_similar() return the k most similar vectors #4364

most_similar() return the k most similar vectors #4364

bintay commented Oct 2, 2019

ines commented Oct 2, 2019 •

edited

Loading

bintay commented Oct 3, 2019

honnibal commented Oct 3, 2019

most_similar() return the k most similar vectors #4364

most_similar() return the k most similar vectors #4364

Conversation

bintay commented Oct 2, 2019

Description

Types of change

Checklist

ines commented Oct 2, 2019 • edited Loading

bintay commented Oct 3, 2019

honnibal commented Oct 3, 2019

ines commented Oct 2, 2019 •

edited

Loading