Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use as source for "annif index" one file (.csv, .tsv) #639

Open
hekl opened this issue Nov 1, 2022 · 1 comment
Open

Use as source for "annif index" one file (.csv, .tsv) #639

hekl opened this issue Nov 1, 2022 · 1 comment

Comments

@hekl
Copy link

hekl commented Nov 1, 2022

In my use case for indexing short bits of text with "annif index" it is rather clumsy to produce thousands of short text files for indexing, process them and read the output text files again into some database. With short bits of text I mean something from one word to sixty words maximum. I actually have under ten words. In this case the required number of keywords would be very small, I would say four at maximum. Annif index lets you specify the number of keywords.
The input .csv file would have one line per text with an identifier (either a code or a URI) and the text. The output in .csv would have this data, plus: the URI's of the vocabulary, the similarity score and optionally the labels of the vocabulary. The output file name could be a versioned form of the project_id or something you can define as output name. There is one kind of structure in the output that is reasonable easy to reuse:

Add as many rows as there are vocabulary URI's, scores and labels to the data file and copy the original input on these rows too. ID being the original identifier, text the original text to index, vocab_uri the URI of the vocabulary, label the label of this URI and score the similarity or confidence score. The input would then have consisted of the first two columns, but just one row.

id;text;vocab_uri;vocab_label;score
D003018;prices of consumer products rising high;https://vocab/id/1010;prices;1.0
D003018;prices of consumer products rising high;https://vocab/id/1013;consumer products;1.0

Other output structures like adding as many columns as URI's and labels as there will be, introduce problems with interpretation (which labels belong to which URI's?) and processing (you have to find out how many result columns there are and what they mean). This structure is in my view only adequate if you just want URI's, no labels and scores.

You could also add the URI's in one column, comma separated. But this introduces an extra handicap in processing the result file. This might be a more acceptable alternative, when you are satisfied with just the URI's. Another column for all the scores would be possible and would follow the order of the URI's. Again this introduces extra scripting processing effort.

@osma
Copy link
Member

osma commented Nov 2, 2022

Thanks for the suggestion and especially your thoughts on the input and output file formats, which seem very reasonable. Right now I cannot promise anything about implementation, but I don't see this as very complicated, it's just a question of priorities. Also, it would be helpful to know if others are in the same situation and would find this useful. We don't actually use the annif index command in its current form at all, as it's more straightforward to use the REST API for bulk indexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants