Harvest DOIs from Zenodo Proof of Concept #5880

coverbeck · 2024-05-07T18:13:08Z

Description
This is very drafty; I'm looking for feedback on the overall concept, and whether we should go down this route. FWIW, I think we should.

The code looks for Zenodo DOIs against GitHub repos for published workflows. It found DOIs against 89 repos that have registered workflows in Dockstore. 19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

To complete this PR:

Fetch all the DOI versions related to a single DOI (another Zenodo call)
Assign the DOIs to workflow versions

Things to figure out

Some of these also come up in #5879

Do we snapshot a version before assigning the DOI? I would argue no.
How do we handle a DOI against a repo with multiple workflows?
- Assign the DOI to all workflows in the repo?
- Ignore the DOI, i.e., don't assign it. This is my preference at least for the first iteration of this.
- Some UI that lets the user choose with workflow(s) to apply a DOI to.
Do we need to support multiple DOIs for a version?
- If no, do we just silently fail if there is already a DOI assigned?
We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.
We can run the endpoint from this PR when we deploy 1.16 to capture existing DOIs, but how do we capture DOIs created after the deploy? In my few tests, there the creation of the DOI by the Zenodo/GitHub creation took between a few seconds to several minutes to complete. Which means we can't assume the DOI already exists when we get notified by GitHub apps that a tag has been created. Some ideas:
- We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?
- We could track creation of tags and invoke a variant of the endpoint for just the known new tags.

Review Instructions

Issue
#5745

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

Security and Privacy assessed

e.g. Does this change...

Any user data we collect, or data location?
Access control, authentication or authorization?
Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

coverbeck · 2024-05-08T00:36:00Z

Ran it on all our workflows. Found 71 repos referenced by DOIs that only have 1 workflow in Dockstore. Found an additional 18 repos referenced by DOIs, that have more than 1 workflow.

coverbeck · 2024-05-08T18:29:01Z

Here is the result for repos with 1 workflow:

[
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10901674"
    ],
    "repo": "nf-core/airrflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4208836"
    ],
    "repo": "h3abionet/TADA"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10846111"
    ],
    "repo": "nf-core/funcscan"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10687430"
    ],
    "repo": "nf-core/eager"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10986616"
    ],
    "repo": "nf-core/riboseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10463781"
    ],
    "repo": "nf-core/methylseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10643212"
    ],
    "repo": "nf-core/circdna"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10911752"
    ],
    "repo": "nf-core/metatdenovo"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7643948"
    ],
    "repo": "nf-core/phyloplace"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10707294"
    ],
    "repo": "nf-core/epitopeprediction"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8164980"
    ],
    "repo": "nf-core/viralintegration"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104871"
    ],
    "repo": "denis-yuen/galaxy-workflow-dockstore-example-2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7764938",
      "https://doi.org/10.5281/zenodo.3746584"
    ],
    "repo": "nf-core/viralrecon"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4723017"
    ],
    "repo": "nf-core/clipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10651816"
    ],
    "repo": "nf-core/mag"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8427707"
    ],
    "repo": "nf-core/mhcquant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7220729"
    ],
    "repo": "nf-core/hlatyping"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6515313"
    ],
    "repo": "nf-core/hicar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696391"
    ],
    "repo": "nf-core/smrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7139814"
    ],
    "repo": "nf-core/chipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4106005"
    ],
    "repo": "nf-core/proteomicslfq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10554425"
    ],
    "repo": "nf-core/scrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10471647"
    ],
    "repo": "nf-core/rnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.11126488"
    ],
    "repo": "nf-core/sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10650749"
    ],
    "repo": "nf-core/molkart"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10783110"
    ],
    "repo": "nf-core/nascent"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10952554"
    ],
    "repo": "nf-core/rnafusion"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6669637"
    ],
    "repo": "nf-core/rnavar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4469317"
    ],
    "repo": "gatk-workflows/gatk4-data-processing"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403063"
    ],
    "repo": "kathy-t/workflow-dockstore-yml"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104898"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7994878"
    ],
    "repo": "nf-core/hic"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10036158"
    ],
    "repo": "nf-core/bacass"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10209675"
    ],
    "repo": "nf-core/differentialabundance"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7716033"
    ],
    "repo": "nf-core/nanoseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7689178"
    ],
    "repo": "denis-yuen/test-workflows-and-tools"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10728509"
    ],
    "repo": "nf-core/fetchngs"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10124950"
    ],
    "repo": "kathy-t/SRANWRP"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7941033"
    ],
    "repo": "nf-core/hgtseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5039442"
    ],
    "repo": "dockstore-personal-testing/gatk4-exome-analysis-pipeline-flat"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4304953"
    ],
    "repo": "Richard-Hansen/hello_world"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3508160",
      "https://doi.org/10.5281/zenodo.3928817",
      "https://doi.org/10.5281/zenodo.3401699"
    ],
    "repo": "wshands/hmmer-docker"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7080256"
    ],
    "repo": "david4096/autopotato-attack"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8222875"
    ],
    "repo": "nf-core/atacseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2628872"
    ],
    "repo": "ICGC-TCGA-PanCancer/Seqware-BWA-Workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6141389"
    ],
    "repo": "Richard-Hansen/dockstore-tool-helloworld"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104873"
    ],
    "repo": "garyluu/example_cwl_workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.1491630"
    ],
    "repo": "nf-core/deepvariant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3571864"
    ],
    "repo": "nf-core/neutronstar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2582812"
    ],
    "repo": "SciLifeLab/Sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4536530"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10668725"
    ],
    "repo": "ENCODE-DCC/atac-seq-pipeline"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10718449"
    ],
    "repo": "nf-core/demultiplex"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354570"
    ],
    "repo": "iwc-workflows/rnaseq-pe"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10067429"
    ],
    "repo": "nf-core/quantms"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7629996"
    ],
    "repo": "nf-core/proteinfold"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10622411"
    ],
    "repo": "nf-core/readsimulator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10868876"
    ],
    "repo": "nf-core/raredisease"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10912278"
    ],
    "repo": "nf-core/ampliseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10527467"
    ],
    "repo": "nf-core/nanostring"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8159051"
    ],
    "repo": "nf-core/marsseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10634361"
    ],
    "repo": "nf-core/taxprofiler"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354480"
    ],
    "repo": "iwc-workflows/rnaseq-sr"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403310"
    ],
    "repo": "kathy-t/hello-wdl-workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10895229"
    ],
    "repo": "nf-core/pixelator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10606804"
    ],
    "repo": "nf-core/cutandrun"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10877148"
    ],
    "repo": "nf-core/detaxizer"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4540719"
    ],
    "repo": "nf-core/dualrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10406093"
    ],
    "repo": "nf-core/metaboigniter"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8414663"
    ],
    "repo": "nf-core/bamtofastq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696998"
    ],
    "repo": "nf-core/rnasplice"
  }
]

denis-yuen · 2024-05-08T20:49:35Z

Do we snapshot a version before assigning the DOI? I would argue no.

I think no too. I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

This is my preference at least for the first iteration of this.

Works for me, also ok with the "let the user choose"
For most purposes, I'd think that "let the user choose from a list of likely suspects" would be ok as a first/second pass too, just not a final pass

We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.

Kinda feel like they should just be separately tracked/different classes.

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?

Seems familiar @kathy-t
More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.)
Unless we do it all externally like with the ECS cron

denis-yuen · 2024-05-08T20:55:22Z

dockstore-webservice/src/main/java/io/dockstore/webservice/helpers/ZenodoHelper.java

+        final PreviewApi previewApi = new PreviewApi(zenodoClient);
+        final String query = URLEncoder.encode('"' + gitHubRepo + '"');
+        final int pageSize = 100;
+        final SearchResult records = previewApi.listRecords(query, "bestmatch", 1, pageSize);


Any significance to 100?
Seems like we could get away with lower, especially if we ignore multiple DOIs for a repo as a first pass
(or is the search likely to get stuff from other repos?)

It was a guess to start with. Because it's elastic search and not an exact match, which I can't figure out how to do, I figure cast a wider net and narrow down on a specific field. It will probably require some tinkering.

kathy-t · 2024-05-09T13:56:24Z

I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

I'm working on implementing this in my PR to allow users to generate their own DOIs for a workflow that already has Dockstore DOIs

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?
Seems familiar @kathy-t
More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.)
Unless we do it all externally like with the ECS cron

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

denis-yuen · 2024-05-09T17:11:49Z

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue

May be overkill, just like with topics, it should be possible to just compute which tags are eligible for DOIs and just process them without needing to keep any extra state around.

svonworl · 2024-05-09T21:59:35Z

19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

The "GitHub" DOIs seem to reference the repo, rather than a particular entry within. So, maybe said DOIs reference all entries in the repo? [Postscript: just noticed you mentioned this possibility farther down in your description...]

svonworl · 2024-05-09T22:23:12Z

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces and runs alongside the webservice. It figures out what needs to be updated (via a queue or maybe a periodic db query that returns what's been recently changed, etc) and then updates it. Could update AI topics, collect DOIs, etc. Could be a single monolithic updater with plugins, or separate updaters specialized for each type of update. Good for asynchronous updates that might produce tardy responses if we tried to do the updates in the webservice request handlers. Not as scalable as lambdas, but probably easier to code.

svonworl · 2024-05-10T00:13:35Z

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces [...]

A variant is that, instead of a separate application, the "updater" is a pool of background priority threads that runs in the webservice application itself. It pulls tasks from a pool and executes them (where a task is something like "update the AI topic for this entry" or "collect the GitHub DOIs for this entry"). The main request thread handler can queue tasks up before it returns, and they'll run asynchronously, later, in their own database session. And/or, tasks can be queued by a thread that inspects the db to determine what needs updates. Or periodically...

[skip ci] Initial experiments

e55c473

coverbeck self-assigned this May 8, 2024

coverbeck marked this pull request as draft May 8, 2024 16:33

coverbeck changed the title ~~[skip ci] Initial experiments~~ Harvest DOIs from Zenodo Proof of Concept May 8, 2024

coverbeck mentioned this pull request May 8, 2024

Add search and Doi versions APIs to Swagger dockstore/swagger-java-zenodo-client#22

Open

9 tasks

[skip ci] Tweak logic

5270b30

coverbeck requested review from denis-yuen, kathy-t, david4096 and svonworl May 8, 2024 18:31

denis-yuen reviewed May 8, 2024

View reviewed changes

[skip ci] Placate checkstyle

12f1754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest DOIs from Zenodo Proof of Concept #5880

Harvest DOIs from Zenodo Proof of Concept #5880

coverbeck commented May 7, 2024 •

edited

coverbeck commented May 8, 2024

coverbeck commented May 8, 2024

denis-yuen commented May 8, 2024 •

edited

denis-yuen May 8, 2024

coverbeck May 8, 2024

kathy-t commented May 9, 2024

denis-yuen commented May 9, 2024

svonworl commented May 9, 2024 •

edited

svonworl commented May 9, 2024

svonworl commented May 10, 2024 •

edited

Harvest DOIs from Zenodo Proof of Concept #5880

Are you sure you want to change the base?

Harvest DOIs from Zenodo Proof of Concept #5880

Conversation

coverbeck commented May 7, 2024 • edited

coverbeck commented May 8, 2024

coverbeck commented May 8, 2024

denis-yuen commented May 8, 2024 • edited

denis-yuen May 8, 2024

Choose a reason for hiding this comment

coverbeck May 8, 2024

Choose a reason for hiding this comment

kathy-t commented May 9, 2024

denis-yuen commented May 9, 2024

svonworl commented May 9, 2024 • edited

svonworl commented May 9, 2024

svonworl commented May 10, 2024 • edited

coverbeck commented May 7, 2024 •

edited

denis-yuen commented May 8, 2024 •

edited

svonworl commented May 9, 2024 •

edited

svonworl commented May 10, 2024 •

edited