Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with overrepresented concepts / blacklisting #735

Open
annakasprzik opened this issue Sep 13, 2023 · 3 comments
Open

Dealing with overrepresented concepts / blacklisting #735

annakasprzik opened this issue Sep 13, 2023 · 3 comments

Comments

@annakasprzik
Copy link

annakasprzik commented Sep 13, 2023

Several institutions have observed that some models / ensembles struggle with concepts that are overrepresented in the training data so that they are suggested way to often. One fix for that is to identify rules that limit the contexts in which those concepts can be suggested. Could we implement something in Annif that allows specifying those rules?

CC @schlawiner @Lakshmi-bashyam

related: #538 ; #596

@osma
Copy link
Member

osma commented Sep 13, 2023

Thank you for the suggestion. Indeed this seems like a recurring problem, so a generic mechanism could be useful.

This was one of the ideas discussed in issue #538, especially in #538 (comment) . But there were maybe too many ideas thrown around and so far nothing has been implemented. So let's keep this issue focused on only the problem of overrepresented concepts and the possible solution to make it possible to block problematic concepts, since it seems that both ZBW and ZPID have already decided to use such a mechanism implemented outside Annif.

I think this configuration example from #538 (comment) is still valid:

[omikuji_stw_en]
vocab=stw_9_10
exclude_concepts=http://zbw.eu/stw/descriptor/19073-6,http://zbw.eu/stw/descriptor/17829-1
backend=omikuji

and the meaning of this would be that the two concepts (USA and Theory) listed in exclude_concepts are ignored both when reading/processing training data and when generating suggestions, but only for this particular project. There could still be other projects using the same vocabulary and the setting would of course not affect those. So in an ensemble, it would be possible to block specific concepts on the level of a particular backend project, if it has a tendency to suggest certain concepts too often without good reason.

As noted in #538, it would make sense to avoid the term "blacklisting" due to connotations. I think "exclude", "block" or "deny" are all valid alternatives.

@osma
Copy link
Member

osma commented Sep 22, 2023

I've thought about the best way to implement something like this in Annif code.

I think this should be a general mechanism and ideally no changes to individual backend implementations should be necessary. This means that the setting should be handled on the level of AnnifProject. One possibility is that SubjectIndex would be made aware of the blocked/excluded concepts, similar to how it already handles deprecated concepts.

For the configuration, this could be implemented as an extra option to the vocab setting. There is already a mechanism to set the vocabulary language using a setting such as vocab=lcsh(en). We could extend that to take another parameter, like this:

vocab=stw(en,exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

When there's no need to set the language, this could work as well:

vocab=stw(exclude=http://zbw.eu/stw/descriptor/19073-6 http://zbw.eu/stw/descriptor/17829-1)

One minor syntax consideration here is that commas are already used to separate different parameters, so it's not possible to use commas as a separator between concept URIs. Above I've used spaces instead, but other symbols such as | (pipe) could work as well - as long as they are not used in URIs.

@juhoinkinen
Copy link
Member

Just throwing in the idea: could the denylisting be (also) "dynamic", in the sense that the suggest request could include a parameter containing the concepts that are not wanted at that particular time? I think there could be some users of Annif API that could benefit from this.

This could be useful for e.g. university repositories, as very many theses and dissertations get the "final projects (education)" concept as a unwanted and redundant suggestion. I assume they now exclude that concept in their own system(?) to not show it to the student.

Another use case would be to restrict the suggestions using the ontology hierarchy e.g. to only all physical objects or some groups. There could be even a UI component where a user could select the allowed or denied concepts in the hierarchy tree. That would be cool, but maybe not so useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants