-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvest DOIs from Zenodo Proof of Concept #5880
base: develop
Are you sure you want to change the base?
Conversation
Ran it on all our workflows. Found 71 repos referenced by DOIs that only have 1 workflow in Dockstore. Found an additional 18 repos referenced by DOIs, that have more than 1 workflow. |
Here is the result for repos with 1 workflow:
|
I think no too. I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.
Works for me, also ok with the "let the user choose"
Kinda feel like they should just be separately tracked/different classes.
Seems familiar @kathy-t |
final PreviewApi previewApi = new PreviewApi(zenodoClient); | ||
final String query = URLEncoder.encode('"' + gitHubRepo + '"'); | ||
final int pageSize = 100; | ||
final SearchResult records = previewApi.listRecords(query, "bestmatch", 1, pageSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any significance to 100?
Seems like we could get away with lower, especially if we ignore multiple DOIs for a repo as a first pass
(or is the search likely to get stuff from other repos?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a guess to start with. Because it's elastic search and not an exact match, which I can't figure out how to do, I figure cast a wider net and narrow down on a specific field. It will probably require some tinkering.
I'm working on implementing this in my PR to allow users to generate their own DOIs for a workflow that already has Dockstore DOIs
Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc) |
May be overkill, just like with topics, it should be possible to just compute which tags are eligible for DOIs and just process them without needing to keep any extra state around. |
The "GitHub" DOIs seem to reference the repo, rather than a particular entry within. So, maybe said DOIs reference all entries in the repo? [Postscript: just noticed you mentioned this possibility farther down in your description...] |
Random idea (with its own set of pros and cons): An "updater" Java application that accesses the db via our Hibernate interfaces and runs alongside the webservice. It figures out what needs to be updated (via a queue or maybe a periodic db query that returns what's been recently changed, etc) and then updates it. Could update AI topics, collect DOIs, etc. Could be a single monolithic updater with plugins, or separate updaters specialized for each type of update. Good for asynchronous updates that might produce tardy responses if we tried to do the updates in the webservice request handlers. Not as scalable as lambdas, but probably easier to code. |
A variant is that, instead of a separate application, the "updater" is a pool of background priority threads that runs in the webservice application itself. It pulls tasks from a pool and executes them (where a task is something like "update the AI topic for this entry" or "collect the GitHub DOIs for this entry"). The main request thread handler can queue tasks up before it returns, and they'll run asynchronously, later, in their own database session. And/or, tasks can be queued by a thread that inspects the db to determine what needs updates. Or periodically... |
Description
This is very drafty; I'm looking for feedback on the overall concept, and whether we should go down this route. FWIW, I think we should.
The code looks for Zenodo DOIs against GitHub repos for published workflows. It found DOIs against 89 repos that have registered workflows in Dockstore. 19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.
To complete this PR:
Things to figure out
Some of these also come up in #5879
Review Instructions
Issue
#5745
Security and Privacy
If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.
e.g. Does this change...
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation