Dealing with overrepresented concepts / blacklisting #735

annakasprzik · 2023-09-13T15:07:19Z

Several institutions have observed that some models / ensembles struggle with concepts that are overrepresented in the training data so that they are suggested way to often. One fix for that is to identify rules that limit the contexts in which those concepts can be suggested. Could we implement something in Annif that allows specifying those rules?

CC @schlawiner @Lakshmi-bashyam

related: #538 ; #596

osma · 2023-09-13T18:41:55Z

Thank you for the suggestion. Indeed this seems like a recurring problem, so a generic mechanism could be useful.

This was one of the ideas discussed in issue #538, especially in #538 (comment) . But there were maybe too many ideas thrown around and so far nothing has been implemented. So let's keep this issue focused on only the problem of overrepresented concepts and the possible solution to make it possible to block problematic concepts, since it seems that both ZBW and ZPID have already decided to use such a mechanism implemented outside Annif.

I think this configuration example from #538 (comment) is still valid:

[omikuji_stw_en]
vocab=stw_9_10
exclude_concepts=http://zbw.eu/stw/descriptor/19073-6,http://zbw.eu/stw/descriptor/17829-1
backend=omikuji

and the meaning of this would be that the two concepts (USA and Theory) listed in exclude_concepts are ignored both when reading/processing training data and when generating suggestions, but only for this particular project. There could still be other projects using the same vocabulary and the setting would of course not affect those. So in an ensemble, it would be possible to block specific concepts on the level of a particular backend project, if it has a tendency to suggest certain concepts too often without good reason.

As noted in #538, it would make sense to avoid the term "blacklisting" due to connotations. I think "exclude", "block" or "deny" are all valid alternatives.

osma · 2023-09-22T13:40:29Z

I've thought about the best way to implement something like this in Annif code.

I think this should be a general mechanism and ideally no changes to individual backend implementations should be necessary. This means that the setting should be handled on the level of AnnifProject. One possibility is that SubjectIndex would be made aware of the blocked/excluded concepts, similar to how it already handles deprecated concepts.

For the configuration, this could be implemented as an extra option to the vocab setting. There is already a mechanism to set the vocabulary language using a setting such as vocab=lcsh(en). We could extend that to take another parameter, like this:

vocab=stw(en,exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

When there's no need to set the language, this could work as well:

vocab=stw(exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

One minor syntax consideration here is that commas are already used to separate different parameters, so it's not possible to use commas as a separator between concept URIs. Above I've used spaces instead, but other symbols such as | (pipe) could work as well - as long as they are not used in URIs.

juhoinkinen · 2024-04-19T11:54:26Z

Just throwing in the idea: could the denylisting be (also) "dynamic", in the sense that the suggest request could include a parameter containing the concepts that are not wanted at that particular time? I think there could be some users of Annif API that could benefit from this.

This could be useful for e.g. university repositories, as very many theses and dissertations get the "final projects (education)" concept as a unwanted and redundant suggestion. I assume they now exclude that concept in their own system(?) to not show it to the student.

Another use case would be to restrict the suggestions using the ontology hierarchy e.g. to only all physical objects or some groups. There could be even a UI component where a user could select the allowed or denied concepts in the hierarchy tree. That would be cool, but maybe not so useful.

osma added the enhancement label Sep 13, 2023

osma added this to the Short term milestone Sep 13, 2023

osma mentioned this issue Sep 13, 2023

Single Concept Classifier for handling label inbalance #538

Open

2 tasks

osma mentioned this issue Sep 22, 2023

optimization: load a vocabulary only once even if used in different languages #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with overrepresented concepts / blacklisting #735

Dealing with overrepresented concepts / blacklisting #735

annakasprzik commented Sep 13, 2023 •

edited

osma commented Sep 13, 2023

osma commented Sep 22, 2023

juhoinkinen commented Apr 19, 2024

Dealing with overrepresented concepts / blacklisting #735

Dealing with overrepresented concepts / blacklisting #735

Comments

annakasprzik commented Sep 13, 2023 • edited

osma commented Sep 13, 2023

osma commented Sep 22, 2023

juhoinkinen commented Apr 19, 2024

annakasprzik commented Sep 13, 2023 •

edited