Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT Backend #625

Open
lunactic opened this issue Sep 26, 2022 · 2 comments
Open

BERT Backend #625

lunactic opened this issue Sep 26, 2022 · 2 comments

Comments

@lunactic
Copy link

Hello

I am currently working at Swiss National Library experimenting with Annif for the automatic generation of Dewey numbers.
In that process I started experimenting with BERT approaches as explained here: https://www.sbert.net/examples/applications/semantic-search/README.html#semantic-search

First tests indicate that this approach could work very well. Would this be interesting to the whole Annif community. If yes then I could check if I'll find the time for creating a PR that implements this as a backend for Annif.

The approach that I would follow is to to create the embeddings for the training corpus when annif train is used and store them as a pickle file for use later.
The methodology would also allow for "retraining", meaning embeddings of new documents could simply be appended to the existing training corpus.

@osma
Copy link
Member

osma commented Sep 26, 2022

Hello @lunactic , thank you for the suggestion!

There is already some work being done to integrate Annif with language models, mainly by integrating the XTransformer model from PECOS in PR #540 by @mo-fu . But what you propose seems somewhat different.

The idea of semantic search is not new but actually implemented already in the simplest Annif backend, tfidf. But of course it doesn't use a language model, it just converts the text from the training documents (aggregated by subjects, so e.g. all text related to the subject "cars" is concatenated into a single virtual "document" representing the subject "cars") into tf-idf vector space and then at suggest time, the input is also converted into a similar vector and the nearest neighbors (subjects) returned. Conceptually what you propose seems similar, except that instead of simple tf-idf vectors, you would use embeddings from BERT or some other language model.

Do you have any idea of how accurate this kind of model could be, for example for Dewey classification? Did you compare it with other approaches? I've had quite good results on DDC classification with SVC and Omikuji Bonsai, which both achieve pretty similar accuracies. If your approach (which would undoubtedly be way more resource-intensive) would be more accurate than this "baseline", that would be interesting and support the idea of integrating it with Annif.

As I understand it, XTransformer is specifically tailored for extreme multi-label classification problems, which are typically very challenging because of large vocabularies (many classes/labels), big training corpora with skewed distributions etc. You may want to look at that as well - the PR is already usable and the documentation for how to use can be found in the comments on GitHub.

@mo-fu
Copy link
Contributor

mo-fu commented Oct 7, 2022

Just wanted to add some reading material for semantic search on dense word vectors:

As mentioned by @osma this does not yet handle the label distribution issue of XML problems. But can probably combined with the clustering techniques in Parabel/Bonsai. The Omikuji library even has the option to only learn the label tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants