Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest DOIs from Zenodo Proof of Concept #5880

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

coverbeck
Copy link
Contributor

@coverbeck coverbeck commented May 7, 2024

Description
This is very drafty; I'm looking for feedback on the overall concept, and whether we should go down this route. FWIW, I think we should.

The code looks for Zenodo DOIs against GitHub repos for published workflows. It found DOIs against 89 repos that have registered workflows in Dockstore. 19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

To complete this PR:

  1. Fetch all the DOI versions related to a single DOI (another Zenodo call)
  2. Assign the DOIs to workflow versions

Things to figure out

Some of these also come up in #5879

  • Do we snapshot a version before assigning the DOI? I would argue no.
  • How do we handle a DOI against a repo with multiple workflows?
    • Assign the DOI to all workflows in the repo?
    • Ignore the DOI, i.e., don't assign it. This is my preference at least for the first iteration of this.
    • Some UI that lets the user choose with workflow(s) to apply a DOI to.
  • Do we need to support multiple DOIs for a version?
    • If no, do we just silently fail if there is already a DOI assigned?
  • We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.
  • We can run the endpoint from this PR when we deploy 1.16 to capture existing DOIs, but how do we capture DOIs created after the deploy? In my few tests, there the creation of the DOI by the Zenodo/GitHub creation took between a few seconds to several minutes to complete. Which means we can't assume the DOI already exists when we get notified by GitHub apps that a tag has been created. Some ideas:
    • We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?
    • We could track creation of tags and invoke a variant of the endpoint for just the known new tags.

Review Instructions

Issue
#5745

Security and Privacy

If there are any concerns that require extra attention from the security team, highlight them here and check the box when complete.

  • Security and Privacy assessed

e.g. Does this change...

  • Any user data we collect, or data location?
  • Access control, authentication or authorization?
  • Encryption features?

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@coverbeck coverbeck self-assigned this May 8, 2024
@coverbeck
Copy link
Contributor Author

Ran it on all our workflows. Found 71 repos referenced by DOIs that only have 1 workflow in Dockstore. Found an additional 18 repos referenced by DOIs, that have more than 1 workflow.

@coverbeck coverbeck marked this pull request as draft May 8, 2024 16:33
@coverbeck coverbeck changed the title [skip ci] Initial experiments Harvest DOIs from Zenodo Proof of Concept May 8, 2024
@coverbeck
Copy link
Contributor Author

Here is the result for repos with 1 workflow:

[
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10901674"
    ],
    "repo": "nf-core/airrflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4208836"
    ],
    "repo": "h3abionet/TADA"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10846111"
    ],
    "repo": "nf-core/funcscan"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10687430"
    ],
    "repo": "nf-core/eager"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10986616"
    ],
    "repo": "nf-core/riboseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10463781"
    ],
    "repo": "nf-core/methylseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10643212"
    ],
    "repo": "nf-core/circdna"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10911752"
    ],
    "repo": "nf-core/metatdenovo"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7643948"
    ],
    "repo": "nf-core/phyloplace"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10707294"
    ],
    "repo": "nf-core/epitopeprediction"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8164980"
    ],
    "repo": "nf-core/viralintegration"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104871"
    ],
    "repo": "denis-yuen/galaxy-workflow-dockstore-example-2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7764938",
      "https://doi.org/10.5281/zenodo.3746584"
    ],
    "repo": "nf-core/viralrecon"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4723017"
    ],
    "repo": "nf-core/clipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10651816"
    ],
    "repo": "nf-core/mag"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8427707"
    ],
    "repo": "nf-core/mhcquant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7220729"
    ],
    "repo": "nf-core/hlatyping"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6515313"
    ],
    "repo": "nf-core/hicar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696391"
    ],
    "repo": "nf-core/smrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7139814"
    ],
    "repo": "nf-core/chipseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4106005"
    ],
    "repo": "nf-core/proteomicslfq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10554425"
    ],
    "repo": "nf-core/scrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10471647"
    ],
    "repo": "nf-core/rnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.11126488"
    ],
    "repo": "nf-core/sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10650749"
    ],
    "repo": "nf-core/molkart"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10783110"
    ],
    "repo": "nf-core/nascent"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10952554"
    ],
    "repo": "nf-core/rnafusion"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6669637"
    ],
    "repo": "nf-core/rnavar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4469317"
    ],
    "repo": "gatk-workflows/gatk4-data-processing"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403063"
    ],
    "repo": "kathy-t/workflow-dockstore-yml"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104898"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7994878"
    ],
    "repo": "nf-core/hic"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10036158"
    ],
    "repo": "nf-core/bacass"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10209675"
    ],
    "repo": "nf-core/differentialabundance"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7716033"
    ],
    "repo": "nf-core/nanoseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7689178"
    ],
    "repo": "denis-yuen/test-workflows-and-tools"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10728509"
    ],
    "repo": "nf-core/fetchngs"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10124950"
    ],
    "repo": "kathy-t/SRANWRP"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7941033"
    ],
    "repo": "nf-core/hgtseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5039442"
    ],
    "repo": "dockstore-personal-testing/gatk4-exome-analysis-pipeline-flat"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4304953"
    ],
    "repo": "Richard-Hansen/hello_world"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3508160",
      "https://doi.org/10.5281/zenodo.3928817",
      "https://doi.org/10.5281/zenodo.3401699"
    ],
    "repo": "wshands/hmmer-docker"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7080256"
    ],
    "repo": "david4096/autopotato-attack"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8222875"
    ],
    "repo": "nf-core/atacseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2628872"
    ],
    "repo": "ICGC-TCGA-PanCancer/Seqware-BWA-Workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6141389"
    ],
    "repo": "Richard-Hansen/dockstore-tool-helloworld"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.5104873"
    ],
    "repo": "garyluu/example_cwl_workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.1491630"
    ],
    "repo": "nf-core/deepvariant"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.3571864"
    ],
    "repo": "nf-core/neutronstar"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.2582812"
    ],
    "repo": "SciLifeLab/Sarek"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4536530"
    ],
    "repo": "Richard-Hansen/dockstore-whalesay2"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10668725"
    ],
    "repo": "ENCODE-DCC/atac-seq-pipeline"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10718449"
    ],
    "repo": "nf-core/demultiplex"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354570"
    ],
    "repo": "iwc-workflows/rnaseq-pe"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10067429"
    ],
    "repo": "nf-core/quantms"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.7629996"
    ],
    "repo": "nf-core/proteinfold"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10622411"
    ],
    "repo": "nf-core/readsimulator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10868876"
    ],
    "repo": "nf-core/raredisease"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10912278"
    ],
    "repo": "nf-core/ampliseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10527467"
    ],
    "repo": "nf-core/nanostring"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8159051"
    ],
    "repo": "nf-core/marsseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10634361"
    ],
    "repo": "nf-core/taxprofiler"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8354480"
    ],
    "repo": "iwc-workflows/rnaseq-sr"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.6403310"
    ],
    "repo": "kathy-t/hello-wdl-workflow"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10895229"
    ],
    "repo": "nf-core/pixelator"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10606804"
    ],
    "repo": "nf-core/cutandrun"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10877148"
    ],
    "repo": "nf-core/detaxizer"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.4540719"
    ],
    "repo": "nf-core/dualrnaseq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10406093"
    ],
    "repo": "nf-core/metaboigniter"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.8414663"
    ],
    "repo": "nf-core/bamtofastq"
  },
  {
    "dois": [
      "https://doi.org/10.5281/zenodo.10696998"
    ],
    "repo": "nf-core/rnasplice"
  }
]

@denis-yuen
Copy link
Member

denis-yuen commented May 8, 2024

Do we snapshot a version before assigning the DOI? I would argue no.

I think no too. I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

This is my preference at least for the first iteration of this.

Works for me, also ok with the "let the user choose"
For most purposes, I'd think that "let the user choose from a list of likely suspects" would be ok as a first/second pass too, just not a final pass

We probably need to track the "source" of the DOI. For example, Kathy is working on code to issue editing urls for DOIs created by Dockstore with the Dockstore Zenodo account; that won't make sense for these DOIs.

Kinda feel like they should just be separately tracked/different classes.

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?

Seems familiar @kathy-t
More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.)
Unless we do it all externally like with the ECS cron

final PreviewApi previewApi = new PreviewApi(zenodoClient);
final String query = URLEncoder.encode('"' + gitHubRepo + '"');
final int pageSize = 100;
final SearchResult records = previewApi.listRecords(query, "bestmatch", 1, pageSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any significance to 100?
Seems like we could get away with lower, especially if we ignore multiple DOIs for a repo as a first pass
(or is the search likely to get stuff from other repos?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a guess to start with. Because it's elastic search and not an exact match, which I can't figure out how to do, I figure cast a wider net and narrow down on a specific field. It will probably require some tinkering.

@kathy-t
Copy link
Contributor

kathy-t commented May 9, 2024

I think I'm leaning toward @svonworl 's (?) idea to have more database structure around how we track DOIs and track user-generated, auto-generated, and github harvested(?) DOIs separately.

I'm working on implementing this in my PR to allow users to generate their own DOIs for a workflow that already has Dockstore DOIs

We invoke the above endpoint on a schedule, e.g., nightly. How do we do that?
Seems familiar @kathy-t
More for a periodic listener, a registration method for cron-like events, etc. (along with topic generation, metric aggregation, etc.)
Unless we do it all externally like with the ECS cron

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

@denis-yuen
Copy link
Member

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue

May be overkill, just like with topics, it should be possible to just compute which tags are eligible for DOIs and just process them without needing to keep any extra state around.

@svonworl
Copy link
Contributor

svonworl commented May 9, 2024

19 of those repos have more than 1 workflow, we would not be able to tell which workflow(s) the DOI applies to.

The "GitHub" DOIs seem to reference the repo, rather than a particular entry within. So, maybe said DOIs reference all entries in the repo? [Postscript: just noticed you mentioned this possibility farther down in your description...]

@svonworl
Copy link
Contributor

svonworl commented May 9, 2024

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces and runs alongside the webservice. It figures out what needs to be updated (via a queue or maybe a periodic db query that returns what's been recently changed, etc) and then updates it. Could update AI topics, collect DOIs, etc. Could be a single monolithic updater with plugins, or separate updaters specialized for each type of update. Good for asynchronous updates that might produce tardy responses if we tried to do the updates in the webservice request handlers. Not as scalable as lambdas, but probably easier to code.

@svonworl
Copy link
Contributor

svonworl commented May 10, 2024

Perhaps we can set up some type of AWS queue that has these tags that need to be checked and we can have a scheduled lambda that pulls from the queue (AWS doc)

Random idea (with its own set of pros and cons):

An "updater" Java application that accesses the db via our Hibernate interfaces [...]

A variant is that, instead of a separate application, the "updater" is a pool of background priority threads that runs in the webservice application itself. It pulls tasks from a pool and executes them (where a task is something like "update the AI topic for this entry" or "collect the GitHub DOIs for this entry"). The main request thread handler can queue tasks up before it returns, and they'll run asynchronously, later, in their own database session. And/or, tasks can be queued by a thread that inspects the db to determine what needs updates. Or periodically...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants