fix: Allow concurrent and large refgenie downloads #1172

Austin-s-h · 2023-03-28T12:46:17Z

Description

After running into a few instances of the refgenie wrapper downloading individual assets concurrently, I was met with a lock-timeout error when attempting to pull two large assets (I'm looking at you hg38_cdna/salmon_index at ~25Gb). The default timeout is 60 seconds, but I wanted to attempt to handle this error.

So, I slightly modified the refgenie wrapper to attempt to handle the RefgenconfError generated when the lock is unable to be obtained to skip the lock requirement. This may or may not be desirable behavior across all pipelines, but it did resolve the issues with mine as well as pass the testing requirements.

I added a rule that mimics obtaining a large asset, but I am not familiar enough yet with the wrapper system to know if simply adding a rule means that it is tested.

Additionally, the inclusion of force_large=True was necessary to download assets larger than the large threshold of 5Gb

QC

I confirm that:

For all wrappers added by this PR,

there is a test case which covers any introduced changes,
input: and output: file paths in the resulting rule can be changed arbitrarily,
either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
all environment.yaml specifications follow the respective best practices,
wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
the meta.yaml contains a link to the documentation of the respective tool or command,
Snakefiles pass the linting (snakemake --lint),
Snakefiles are formatted with snakefmt,
Python wrapper scripts are formatted with black.
Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

feat: Lofreq indelqual wrapper (snakemake#1166)

dlaehnemann

Thanks for looking into this!

To get your new rule tested, have a look at the existing refgenie test (in test.py):

snakemake-wrappers/test.py

Lines 4322 to 4330 in 0d2c92a

    
           @skip_if_not_modified 
        
           def test_refgenie(): 
        
               try: 
        
                   shutil.copytree("bio/refgenie/test/genome_folder", "/tmp/genome_folder") 
        
               except FileExistsError: 
        
                   # no worries, the directory is already there 
        
                   pass 
        
               os.environ["REFGENIE"] = "/tmp/genome_folder/genome_config.yaml" 
        
               run("bio/refgenie", ["snakemake", "--cores", "1", "--use-conda", "-F"])

You would need to include another test that requires your new rule's output. However, before doing so, it might make sense to change this to an example that is only slighlty over the threshold of 5GB, because otherwise the test would do an excessive download (for example the 25GB you mentioned) and fill up the GitHub Action VM's disk space every time it is run (even though it is only run if anything changes in this wrapper).

dlaehnemann · 2023-03-29T13:26:58Z

bio/refgenie/wrapper.py

+except RefgenconfError:
+    # If read lock timeout, attempt to skip the read lock
+    rgc = refgenconf.RefGenConf(
+        conf_path, writable=True, skip_read_lock=True, genome_exact=False


I couldn't really find out what the exact implications of skip_read_lock=TRUE are, but it seems dangerous to use, to me. Have you also tried increasing wait_max= as an alternative?

I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.

I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.

From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the wait_max= should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.

Do you have the possibility to try wait_max= in your use case and test whether this actually helps?

dlaehnemann · 2023-03-29T13:28:54Z

bio/refgenie/wrapper.py

 # pull asset if necessary
-gat, archive_data, server_url = rgc.pull(genome, asset, tag, force=False)
+gat, archive_data, server_url = rgc.pull(
+    genome, asset, tag, force=False, force_large=True


Is force_large=True a good general default, or would it make more sense to make this settable via the params: keyword in the rule definition? I am assuming their default of prompting has a reason, to avoid accidental downloads of huge reference data, and having to explicitly specify this via params: would at least be a minimal sanity check that the user knows what they are doing.

I think that is a good alternative to implement. As is, there is no way to override this while using the wrapper.

Do you feel comfortable implementing this?

I'd introduce an (optional) params: force_large=True in the (one of the) examples, and parse this here in the wrapper.py with force_large=snakemake.params.get("force_large", None), so defaulting to what the default in the original function is, only changing it if this is a deliberate choice by the user.

sync upstream

chore: sync fork

chore: release 1.26.0

github-actions · 2023-11-01T01:27:34Z

This PR was marked as stale because it has been open for 6 months with no activity.

Austin-s-h and others added 10 commits March 25, 2023 14:19

perf: bump Salmon to v1.10.1

61758c3

fix: reorder to prioritize bioconda

8eb7a1d

Merge branch 'snakemake:master' into master

60f066f

fix: Allow for large assets in refgenie

1c99871

fix: read-only conf acces for concurrent rules dl

d0dcc61

fix: handle refgenconf write lock error

4d3758c

Merge pull request #1 from snakemake/master

562df5f

feat: Lofreq indelqual wrapper (snakemake#1166)

doc: Remove comments

c074c68

fix: formatting refgenie

ee3bad5

feat: add logs to refgenie rules

fce4bb0

dlaehnemann reviewed Mar 29, 2023

View reviewed changes

Austin-s-h and others added 10 commits March 31, 2023 15:11

fix: remove input string from rsem wrapper

73e3a02

perf: reorganize inputs and add back input_string

a3b02b8

fix: switch to Path in order to resolve basename more robustly

0d10e6a

fix: force string in reference_prefix

fe7b05a

Merge pull request #2 from snakemake/master

c2230b2

sync upstream

Merge branch 'snakemake:master' into master

b3d832b

Merge pull request #3 from snakemake/master

5078fa2

chore: sync fork

chore: release 1.26.0

b927c4f

Merge pull request #4 from sansterbioanalytics/release-v1.26.0

6b3d6e3

chore: release 1.26.0

ci: update workflows to self-hosted

64bd5e6

mergify bot mentioned this pull request Apr 17, 2023

Update/trimmomatic #1271

Closed

1 task

github-actions bot added the Stale label Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow concurrent and large refgenie downloads #1172

fix: Allow concurrent and large refgenie downloads #1172

Austin-s-h commented Mar 28, 2023

dlaehnemann left a comment

dlaehnemann Mar 29, 2023

Austin-s-h Mar 29, 2023

dlaehnemann Mar 30, 2023

dlaehnemann Mar 29, 2023

Austin-s-h Mar 29, 2023

dlaehnemann Mar 30, 2023

github-actions bot commented Nov 1, 2023

	@skip_if_not_modified
	def test_refgenie():
	try:
	shutil.copytree("bio/refgenie/test/genome_folder", "/tmp/genome_folder")
	except FileExistsError:
	# no worries, the directory is already there
	pass
	os.environ["REFGENIE"] = "/tmp/genome_folder/genome_config.yaml"
	run("bio/refgenie", ["snakemake", "--cores", "1", "--use-conda", "-F"])

fix: Allow concurrent and large refgenie downloads #1172

Are you sure you want to change the base?

fix: Allow concurrent and large refgenie downloads #1172

Conversation

Austin-s-h commented Mar 28, 2023

Description

QC

dlaehnemann left a comment

Choose a reason for hiding this comment

dlaehnemann Mar 29, 2023

Choose a reason for hiding this comment

Austin-s-h Mar 29, 2023

Choose a reason for hiding this comment

dlaehnemann Mar 30, 2023

Choose a reason for hiding this comment

dlaehnemann Mar 29, 2023

Choose a reason for hiding this comment

Austin-s-h Mar 29, 2023

Choose a reason for hiding this comment

dlaehnemann Mar 30, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 1, 2023