Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy lexical matching backend #629

Open
osma opened this issue Oct 4, 2022 · 0 comments
Open

Fuzzy lexical matching backend #629

osma opened this issue Oct 4, 2022 · 0 comments
Milestone

Comments

@osma
Copy link
Member

osma commented Oct 4, 2022

The MLLM lexical backend (as well as STWFSA) try to match subject labels to document text, but they are quite strict in the matching. I think it could help in some cases to be able to perform fuzzy matching as well, for example matching subject labels even if there are small differences in spelling (e.g. color vs colour, or Chehov vs Chekhov).

This could either be its own backend (maybe called "flm", for fuzzy lexical matching?), or perhaps just an option in the MLLM backend that would allow selecting the matching method so that the user could select between traditional crisp matching and fuzzy matching. When finding fuzzy matches, the match similarity could be included as one of the features used for candidate selection.

One question is how to efficiently implement the matching. There are libraries like TheFuzz (formerly known as FuzzyWuzzy) and fuzzysearch which could perhaps be used. The most promising one I found is RapidFuzz, which seems to be in active development (in fact extremely active), promises to be very fast, and is MIT licensed. This could be an ideal library for the purpose. However, it relies on C++ code so we would have to consider making this into an optional feature instead of a core dependency.

Naturally, some benchmarking would be needed to find out whether this is actually a good idea at all. It's also possible that fuzzy matching doesn't give any benefit over the current matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant