Use subsets of vocabularies in ensembles #596

nwagner84 · 2022-07-15T06:46:27Z

I've a question about the usage of vacabularies in an ensemble. Given a vocabulary V which is used in an ensemble and two vocabularies V1 and V2, which are used by different backends of the ensemble (e.g. omikuji and mllm). V1 and V2 are subsets of V, with different subsets of the concepts and a different set of labels (gold standard (TSV) and TTL). The concept URI are stable in all versions of V.

Is it possible to use such an setup in order to tweak the backends used in the ensemble?
Does the the vocabulary V acts as an allow-list? Are the suggestions of the backends (e.g omikuji, mllm) tested against the vocabulary V?

Background: We want to aggressively tweak the vocabulary (reduce concepts and manipulate labels) for the mllm backend, to improve the results.

juhoinkinen · 2022-07-16T13:18:34Z

I try to give some answer, however I'm not very sure about the details right now.

Is it possible to use such an setup in order to tweak the backends used in the ensemble?

It is possible to use such a setup for ensemble (but not for neural-network ensemble). About the benefits and complications of this I cannot say.

Does the the vocabulary V acts as an allow-list? Are the suggestions of the backends (e.g omikuji, mllm) tested against the vocabulary V?

Yes, the vocabulary loaded to the ensemble project acts an allow list. A warning is shown every time an unknown URI is fed to the ensemble from a source project as an suggestion.

I would be quite careful when implementing such a tweaked setup right now, because there is a (small) change that it would not work in the future, if the inner operations in Annif change. Of course experimenting with various vocabulary tweaks like you have in mind can provide valuable insights, and we would like hear if they are or are not successful.

For general knowledge about the subject (I think you have seen the discussion already), an actual feature for allow-/deny-listing subjects has been proposed in the issue 538. Maybe that feature would be more suitable for your case. Please let us know if you have any suggestions or thoughts of the feature.

osma · 2022-08-01T14:02:01Z

It is possible to use such a setup for ensemble (but not for neural-network ensemble). About the benefits and complications of this I cannot say.

Whether this works or not depends a lot on which backends are involved. Let me explain the background a bit. Annif internally represents the results of a suggest operation using two alternative classes: VectorSuggestionResult and ListSuggestionResult. The first one uses a fixed vector representation, basically a long string of numbers whose length is the size of the vocabulary. The second one instead represents only the top K suggestions as a list which includes the URI and score. Different backends use different representations depending on which one is the most convenient to produce and consume. They can be converted to each other, although it takes some computation.

The vector representation cannot cope with the situation above, where V1 and V2 are subsets of V. So any backend that uses this (including the NN ensemble) will not work. But if you can avoid that, then it probably works, although this was not really something that Annif was originally designed for.

Anyway, I think supporting this more generally - making it possible to use different flavors of a vocabulary in an ensemble and its source backends - would be a nice goal which shouldn't be too hard to implement. It may be enough to adjust the NN ensemble a little bit to fix the vector size mismatch. But more generally, ensembles should be prepared to accept suggestion results with a different vocabulary and map them by URI, regardless of the representation (vector or list).

nwagner84 · 2022-08-08T05:31:53Z

@osma , @juhoinkinen Thank you for the detailed explanation.

It's great that you consider to implement this! Please let me know, if you need help in testing this change.

juhoinkinen · 2023-04-28T08:48:28Z

Commenting here our use (Finto AI) in mind.

YSO-places is a subset of YSO (when used in the vocabulary of Finto AI YSO projects), and there could be a specialized model for suggesting only concepts out of YSO-places. There could be even a specialized backend for this; the idea came from the upcoming special issue "Geographic Information Extraction from Texts" of Information Processing & Management jounal.

san-uh · 2023-05-05T10:35:02Z

Thanks @juhoinkinen for sharing the interesting use case on the topic!
I would like to add that the German National Library as an Annif user has a strong need for the possibility to use backends with different vocabulary sets in the ensemble. We would be very grateful for an implementation as announced above. Let us know if we can help with tests etc.

osma added the enhancement label Aug 1, 2022

osma added this to the Short term milestone Aug 1, 2022

osma changed the title ~~Usage of vocabularies in ensembles~~ Use subsets of vocabularies in ensembles Aug 1, 2022

annakasprzik mentioned this issue Sep 13, 2023

Dealing with overrepresented concepts / blacklisting #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use subsets of vocabularies in ensembles #596

Use subsets of vocabularies in ensembles #596

nwagner84 commented Jul 15, 2022

juhoinkinen commented Jul 16, 2022

osma commented Aug 1, 2022

nwagner84 commented Aug 8, 2022

juhoinkinen commented Apr 28, 2023

san-uh commented May 5, 2023

Use subsets of vocabularies in ensembles #596

Use subsets of vocabularies in ensembles #596

Comments

nwagner84 commented Jul 15, 2022

juhoinkinen commented Jul 16, 2022

osma commented Aug 1, 2022

nwagner84 commented Aug 8, 2022

juhoinkinen commented Apr 28, 2023

san-uh commented May 5, 2023