Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use subsets of vocabularies in ensembles #596

Open
nwagner84 opened this issue Jul 15, 2022 · 5 comments
Open

Use subsets of vocabularies in ensembles #596

nwagner84 opened this issue Jul 15, 2022 · 5 comments

Comments

@nwagner84
Copy link

I've a question about the usage of vacabularies in an ensemble. Given a vocabulary V which is used in an ensemble and two vocabularies V1 and V2, which are used by different backends of the ensemble (e.g. omikuji and mllm). V1 and V2 are subsets of V, with different subsets of the concepts and a different set of labels (gold standard (TSV) and TTL). The concept URI are stable in all versions of V.

  1. Is it possible to use such an setup in order to tweak the backends used in the ensemble?
  2. Does the the vocabulary V acts as an allow-list? Are the suggestions of the backends (e.g omikuji, mllm) tested against the vocabulary V?

Background: We want to aggressively tweak the vocabulary (reduce concepts and manipulate labels) for the mllm backend, to improve the results.

@juhoinkinen
Copy link
Member

I try to give some answer, however I'm not very sure about the details right now.

  1. Is it possible to use such an setup in order to tweak the backends used in the ensemble?

It is possible to use such a setup for ensemble (but not for neural-network ensemble). About the benefits and complications of this I cannot say.

  1. Does the the vocabulary V acts as an allow-list? Are the suggestions of the backends (e.g omikuji, mllm) tested against the vocabulary V?

Yes, the vocabulary loaded to the ensemble project acts an allow list. A warning is shown every time an unknown URI is fed to the ensemble from a source project as an suggestion.

I would be quite careful when implementing such a tweaked setup right now, because there is a (small) change that it would not work in the future, if the inner operations in Annif change. Of course experimenting with various vocabulary tweaks like you have in mind can provide valuable insights, and we would like hear if they are or are not successful.

For general knowledge about the subject (I think you have seen the discussion already), an actual feature for allow-/deny-listing subjects has been proposed in the issue 538. Maybe that feature would be more suitable for your case. Please let us know if you have any suggestions or thoughts of the feature.

@osma
Copy link
Member

osma commented Aug 1, 2022

It is possible to use such a setup for ensemble (but not for neural-network ensemble). About the benefits and complications of this I cannot say.

Whether this works or not depends a lot on which backends are involved. Let me explain the background a bit. Annif internally represents the results of a suggest operation using two alternative classes: VectorSuggestionResult and ListSuggestionResult. The first one uses a fixed vector representation, basically a long string of numbers whose length is the size of the vocabulary. The second one instead represents only the top K suggestions as a list which includes the URI and score. Different backends use different representations depending on which one is the most convenient to produce and consume. They can be converted to each other, although it takes some computation.

The vector representation cannot cope with the situation above, where V1 and V2 are subsets of V. So any backend that uses this (including the NN ensemble) will not work. But if you can avoid that, then it probably works, although this was not really something that Annif was originally designed for.

Anyway, I think supporting this more generally - making it possible to use different flavors of a vocabulary in an ensemble and its source backends - would be a nice goal which shouldn't be too hard to implement. It may be enough to adjust the NN ensemble a little bit to fix the vector size mismatch. But more generally, ensembles should be prepared to accept suggestion results with a different vocabulary and map them by URI, regardless of the representation (vector or list).

@osma osma added this to the Short term milestone Aug 1, 2022
@osma osma changed the title Usage of vocabularies in ensembles Use subsets of vocabularies in ensembles Aug 1, 2022
@nwagner84
Copy link
Author

@osma , @juhoinkinen Thank you for the detailed explanation.

It's great that you consider to implement this! Please let me know, if you need help in testing this change.

@juhoinkinen
Copy link
Member

Commenting here our use (Finto AI) in mind.

YSO-places is a subset of YSO (when used in the vocabulary of Finto AI YSO projects), and there could be a specialized model for suggesting only concepts out of YSO-places. There could be even a specialized backend for this; the idea came from the upcoming special issue "Geographic Information Extraction from Texts" of Information Processing & Management jounal.

@san-uh
Copy link

san-uh commented May 5, 2023

Thanks @juhoinkinen for sharing the interesting use case on the topic!
I would like to add that the German National Library as an Annif user has a strong need for the possibility to use backends with different vocabulary sets in the ensemble. We would be very grateful for an implementation as announced above. Let us know if we can help with tests etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants