Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

most_similar() return the k most similar vectors #4364

Merged
merged 7 commits into from Oct 3, 2019
Merged

most_similar() return the k most similar vectors #4364

merged 7 commits into from Oct 3, 2019

Conversation

bintay
Copy link
Contributor

@bintay bintay commented Oct 2, 2019

Description

Modified the most_similar method of Vectors so that it returns the top n most similar vectors rather than the single most similar vector.

The new value returned looks like (keys, best_rows, scores) where keys, best_rows, and scores are arrays containing an array for each query. So scores[0] would be an array of scores corresponding to the n vectors most similar to the 0th query.

In theory, performance should not be impacted much — argmax is replaced by argpartition, both of which run in linear time with respect to the number of elements. If the user chooses to sort the n most similar entries by setting the sort parameter to true, it will add an O(n lg n) step to sort the entries. However, I haven't tested the performance in practice yet and I might be missing something.

This could break existing code since the return type is changed from a tuple of three 1d arrays to a tuple of three 2d arrays. We could have it use the old method when n = 1 and the new method otherwise, but then it could be confusing to use & documentation could get messy. Could also keep the old method and add a new n_most_similar method.

This is my first time contributing to an open source project, so let me know your thoughts!

Types of change

Enhancement (#3697)

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added enhancement Feature requests and improvements feat / vectors Feature: Word vectors and similarity labels Oct 2, 2019
@ines
Copy link
Member

ines commented Oct 2, 2019

Thank so much 🙏 This is really good timing btw, because I was just working on a refactor of sense2vec, where I want to use that method properly.

Also, I think that one test is failing because Vocab.prune_vectors calls into Vectors.most_similar, so we just need to adjust the method call there.

@bintay
Copy link
Contributor Author

bintay commented Oct 3, 2019

Yeah it's because scores[i], etc are arrays now but it was expecting a number. Changing them to scores[i][0] fixed that error.

Now the test is actually failing — neighbour is "dog" instead of "cat". Because all 3 of the vectors in the test are just scaled versions of each other, their cosine similarity is 1 so either should be valid. I guess argmax just happened to return "cat" whereas argpartition returns "dog".

I updated the test with more "realistic" vectors for now. Let me know if we actually want to test the case where all the vectors point in the same direction.

@honnibal
Copy link
Member

honnibal commented Oct 3, 2019

Thanks! Yeah I think the change to the test is good, that makes sense. Merging 🎉

@honnibal honnibal merged commit 1db79a3 into explosion:master Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants