Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batch suggest in STWFSA backend #666

Open
osma opened this issue Feb 3, 2023 · 2 comments
Open

Support batch suggest in STWFSA backend #666

osma opened this issue Feb 3, 2023 · 2 comments

Comments

@osma
Copy link
Member

osma commented Feb 3, 2023

PR #663 is going to bring support for batch suggest operations.

The STWFSA backend could benefit from implementing _suggest_batch instead of _suggest. It could process a batch of texts with parallel and/or vector operations.

@osma osma added this to the Short term milestone Feb 3, 2023
@osma
Copy link
Member Author

osma commented Mar 3, 2023

I tried this, the code is on the branch issue666-suggest-batch-stwfsa.

Unfortunately, the results were not very encouraging. Batched suggest done this way seems to be slower than the original. Maybe switching to a new representation for suggestion results (see #678) could help.

I also tried using the predict_proba method of stwfsapy, which returns the results as a sparse matrix. But here the problem is that stwfsapy internally uses different numeric IDs for concepts than Annif, so there would have to be an ID mapping mechanism to convert the results into something that Annif can use.

@osma
Copy link
Member Author

osma commented Mar 8, 2023

I'm too lazy to make a table, but here are the main test results. I'm evaluating a YSO STWFSA English model on jyu-theses/eng-test on my 4 core laptop.

Before (master)

1 job

User time (seconds): 201.96
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:23.56

4 jobs

User time (seconds): 288.02
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:19.72

After (issue666-suggest-batch-stwfsa branch)

1 job

User time (seconds): 181.12
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:02.69

4 jobs

User time (seconds): 322.29
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:27.98

Summary

Evaluation was faster when using just 1 job, but slower with 4 jobs.
I didn't include memory usage but it was basically unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant