Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add configurable protocol support to ensembl reference download #2649

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pettyalex
Copy link

@pettyalex pettyalex commented Feb 15, 2024

This PR adds support for configurable protocol, ftp, http, or https to ensembl reference data download and vep cache download.

I was evaluating snakemake-wrapper and widely used workflows as a tool for my group to potentially use, and immediately ran into our firewall rules. By default we're able to make outgoing HTTP and HTTPS requests, but not FTP. I can request a change to our firewall, but it also would be helpful if these rules that download reference files were able to download them over HTTP, as ftp.ensembl.org has has been available over HTTP for a very long time.

If you'd prefer, I can create an issue to track this. I also intend to update test cases on these wrappers, and if you are happy with this pattern I could apply it across all wrappers in this repository that are currently hard-coded to make ftp requests.

QC

  • I confirm that:

For all wrappers added by this PR,

  • there is a test case which covers any introduced changes,
  • input: and output: file paths in the resulting rule can be changed arbitrarily,
  • either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
  • rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
  • all environment.yaml specifications follow the respective best practices,
  • the environment.yaml pinning has been updated by running snakedeploy pin-conda-envs environment.yaml on a linux machine,
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
  • all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
  • stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
  • the meta.yaml contains a link to the documentation of the respective tool or command,
  • Snakefiles pass the linting (snakemake --lint),
  • Snakefiles are formatted with snakefmt,
  • Python wrapper scripts are formatted with black.
  • Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

…embl reference data download and vep cache download.
@fgvieira
Copy link
Collaborator

Thanks for your contribution.
Can you add a test case for (e.g.) HTTP and HTTPS?

@pettyalex
Copy link
Author

I've run snakefmt, black and the linter over these changes. I believe this to be ready to merge.

@fgvieira
Copy link
Collaborator

@pettyalex can you fix the fail test?
And you need to add the tests to test.py.

@pettyalex
Copy link
Author

@fgvieira I fixed all the tests, and now see that what I had previously done introduced multiple ways to generate the desired output files.

I also unintentionally ran black over test.py, which it seems like is not being done? Should I revert running black over test.py, or should we have it running through black as well?

def test_vep_cache_https_protocol():
run(
"bio/vep/cache",
["snakemake", "--cores", "1", "resources/vep/cache", "--use-conda", "-F"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
["snakemake", "--cores", "1", "resources/vep/cache", "--use-conda", "-F"],
["snakemake", "--cores", "1", "resources/vep/cache", "--use-conda", "-F", "-s", "vep_cache_https.smk"],

Comment on lines +5865 to +5866
"-s",
"vep_cache_https.smk",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"-s",
"vep_cache_https.smk",

@@ -5428,17 +5461,26 @@ def test_ensembl_variation_with_contig_lengths():


@skip_if_not_modified
def test_ega_fetch():
def test_ensembl_variation_old_release():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_ensembl_variation_old_release():
def test_ensembl_variation_old_release_https_protocol():

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks! Can you please also add the respective protocol lines to each Snakefile in the test folder? Those are used to generate the wrapper docs, from which people can copy paste into their workflows.

@fgvieira fgvieira mentioned this pull request May 13, 2024
1 task
johanneskoester pushed a commit that referenced this pull request May 28, 2024
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

<!-- Add a description of your PR here-->
Allow for custom URLs (fix issues #366 and #2649).

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
@fgvieira
Copy link
Collaborator

PR #2928 allowed for custom URLs when downloading from VEP cache. Maybe something similar can be done here for ensembl annotation/variation/sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants