Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using directory() issues #31

Open
jjjermiah opened this issue Mar 18, 2024 · 10 comments
Open

using directory() issues #31

jjjermiah opened this issue Mar 18, 2024 · 10 comments

Comments

@jjjermiah
Copy link

jjjermiah commented Mar 18, 2024

Hi there,

I'm unsure if there is something I'm missing here or the directory() functionality is just not usable for the new plugins.
It looks like there is an issue with the mtime being registered.

Snakefile

rule get_directories:
    input:
        expand(
            "data/{directory}",
            directory = ["a", "b", "c"]
        )
    
rule make_a_directory:
    output:
        directory("data/{directory}")
    shell:
        """
        mkdir -p {output}
        """

Output

$ snakemake -c4  --default-storage-provider gcs --default-storage-prefix gs://my_test_bucket/test_snakemake get_directories
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                 count
----------------  -------
get_directories         1
make_a_directory        3
total                   4

Select jobs to execute...
Execute 3 jobs...

[Mon Mar 18 16:03:46 2024]
localrule make_a_directory:
    output: gs://my_test_bucket/test_snakemake/data/b (send to storage)
    jobid: 2
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data/b (send to storage)
    wildcards: directory=b
    resources: tmpdir=/tmp


[Mon Mar 18 16:03:46 2024]
localrule make_a_directory:
    output: gs://my_test_bucket/test_snakemake/data/a (send to storage)
    jobid: 1
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data/a (send to storage)
    wildcards: directory=a
    resources: tmpdir=/tmp


[Mon Mar 18 16:03:46 2024]
localrule make_a_directory:
    output: gs://my_test_bucket/test_snakemake/data/c (send to storage)
    jobid: 3
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data/c (send to storage)
    wildcards: directory=c
    resources: tmpdir=/tmp

Storing in storage: gs://my_test_bucket/test_snakemake/data/b
Finished upload.
WorkflowError:
Failed to get mtime of gs://my_test_bucket/test_snakemake/data/b
ValueError: max() iterable argument is empty
Removing output files of failed job make_a_directory since they might be corrupted:
.snakemake/storage/gcs/my_test_bucket/test_snakemake/data/b, .snakemake/storage/gcs/my_test_bucket/test_snakemake/data/b
Storing in storage: gs://my_test_bucket/test_snakemake/data/a
Finished upload.
WorkflowError:
Failed to get mtime of gs://my_test_bucket/test_snakemake/data/a
ValueError: max() iterable argument is empty
Removing output files of failed job make_a_directory since they might be corrupted:
.snakemake/storage/gcs/my_test_bucket/test_snakemake/data/a, .snakemake/storage/gcs/my_test_bucket/test_snakemake/data/a
Storing in storage: gs://my_test_bucket/test_snakemake/data/c
Finished upload.
WorkflowError:
Failed to get mtime of gs://my_test_bucket/test_snakemake/data/c
ValueError: max() iterable argument is empty
Removing output files of failed job make_a_directory since they might be corrupted:
.snakemake/storage/gcs/my_test_bucket/test_snakemake/data/c, .snakemake/storage/gcs/my_test_bucket/test_snakemake/data/c
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-18T160345.381549.snakemake.log
WorkflowError:
At least one job did not complete successfully.

However, I can confirm that making files explicitly instead of using directory works:

Snakefile

rule get_files:
    input:
        expand(
            "data2/{directory}.txt",
            directory = ["a", "b", "c"]
        )
    
rule make_a_file:
    output:
        "data2/{directory}.txt"
    shell:
        """
        mkdir -p $(dirname {output})
        touch {output}
        """

Output

$ snakemake -c4  --default-storage-provider gcs --default-storage-prefix gs://my_test_bucket/test_snakemake get_files 
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job            count
-----------  -------
get_files          1
make_a_file        3
total              4

Select jobs to execute...
Execute 3 jobs...

[Mon Mar 18 16:05:36 2024]
localrule make_a_file:
    output: gs://my_test_bucket/test_snakemake/data2/b.txt (send to storage)
    jobid: 2
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data2/b.txt (send to storage)
    wildcards: directory=b
    resources: tmpdir=/tmp


[Mon Mar 18 16:05:37 2024]
localrule make_a_file:
    output: gs://my_test_bucket/test_snakemake/data2/a.txt (send to storage)
    jobid: 1
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data2/a.txt (send to storage)
    wildcards: directory=a
    resources: tmpdir=/tmp


[Mon Mar 18 16:05:37 2024]
localrule make_a_file:
    output: gs://my_test_bucket/test_snakemake/data2/c.txt (send to storage)
    jobid: 3
    reason: Missing output files: gs://my_test_bucket/test_snakemake/data2/c.txt (send to storage)
    wildcards: directory=c
    resources: tmpdir=/tmp

Storing in storage: gs://my_test_bucket/test_snakemake/data2/b.txt
Finished upload.
[Mon Mar 18 16:05:37 2024]
Finished job 2.
1 of 4 steps (25%) done
Storing in storage: gs://my_test_bucket/test_snakemake/data2/a.txt
Finished upload.
[Mon Mar 18 16:05:38 2024]
Finished job 1.
2 of 4 steps (50%) done
Storing in storage: gs://my_test_bucket/test_snakemake/data2/c.txt
Finished upload.
[Mon Mar 18 16:05:39 2024]
Finished job 3.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Mar 18 16:05:39 2024]
localrule get_files:
    input: gs://my_test_bucket/test_snakemake/data2/a.txt (retrieve from storage), gs://my_test_bucket/test_snakemake/data2/b.txt (retrieve from storage), gs://my_test_bucket/test_snakemake/data2/c.txt (retrieve from storage)
    jobid: 0
    reason: Input files updated by another job: gs://my_test_bucket/test_snakemake/data2/a.txt (retrieve from storage), gs://my_test_bucket/test_snakemake/data2/b.txt (retrieve from storage), gs://my_test_bucket/test_snakemake/data2/c.txt (retrieve from storage)
    resources: tmpdir=/tmp

Removing local copy of storage file: .snakemake/storage/gcs/my_test_bucket/test_snakemake/data2/a.txt
Removing local copy of storage file: .snakemake/storage/gcs/my_test_bucket/test_snakemake/data2/b.txt
Removing local copy of storage file: .snakemake/storage/gcs/my_test_bucket/test_snakemake/data2/c.txt
[Mon Mar 18 16:05:39 2024]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2024-03-18T160534.827057.snakemake.log

snakemake-storage-plugin-gcs version:

$ python3 -m pip show snakemake-storage-plugin-gcs
Name: snakemake-storage-plugin-gcs
Version: 0.1.4
Summary: A Snakemake storage plugin for Google Cloud Storage
Home-page: https://github.com/snakemake/snakemake-storage-plugin-gcs
Author: Vanessa Sochat
Author-email: sochat1@llnl.gov
License: MIT
Location: /home/bioinf/miniconda3/envs/sreadii-snakemake/lib/python3.12/site-packages
Requires: google-cloud-storage, google-crc32c, snakemake-interface-common, snakemake-interface-storage-plugins
Required-by: 
@jjjermiah
Copy link
Author

It also looks like the get_directories rule does create empty file representations of the directories themself instead of directories which might be causing the issue as the files are created at 12:03:47PM (EST) which is right after the 16:03:46 timestamp from the snakemake output above.

image

image

@mbrenner-arbor
Copy link

I have had the same issue. It looks like the actual directories are published into gs://bucket/.snakemake/storage/gcs/{given-default-storage-prefix}/{specified-directory}

@jjjermiah
Copy link
Author

yeah, locally the directories are created in .snakemake but there seems to be an issue with how it tries to get the mtime() on the remote.

This makes sense since blob storage doesnt necessarily share the concept of "directories" but I'm not sure why the new update has this effect when it worked before

@johanneskoester
Copy link
Contributor

The test suite has been extended to also contain a directory test case. I have activated it here, where we can also apply eventually necessary fixes: #38

@johanneskoester
Copy link
Contributor

Seems to work fine now. Is this still an issue for you with the latest releases of the plugin, snakemake and the interface packages?

@jjjermiah
Copy link
Author

Thanks @johanneskoester for looking into this.

It doesnt look like its working still, but I just realized what @mbrenner-arbor mentioned since .snakemake is hidden on GCS buckets.

I think I have a fix

@jjjermiah
Copy link
Author

Created a PR with #41. @johanneskoester let me know if this doesnt follow the plugin guidelines and I can ammend.

thanks!

@jeffhsu3
Copy link

jeffhsu3 commented May 7, 2024

Somewhat related but on the input side regarding downloading if a 'directory' prefix is passed as input. Several of the snakemake-wrappers point to a folder in the input (ie alignment index folders), should this plugin default to downloading to be with compatible with some of the wrappers?

@vsoch
Copy link
Collaborator

vsoch commented May 7, 2024

I don't understand the question - I think there is already support for downloading a directory?

if self.is_directory():
self._download_directory()
. Is there a specific example or use case not working?

@jeffhsu3
Copy link

jeffhsu3 commented May 7, 2024

Sorry missed that change, thanks!

I encountered an issue while using the Salmon Quant wrapper (https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/salmon/quant.html) with a Kubernetes executor. The logs only show the retrieval of the main index directory

Retrieving .snakemake/storage/gcs/ivynatal-tpu-life-sciences/ref/GRCh38/GCF_000001405.40_GRCh38.p14_salmon_index from storage.

The salmon index is passed in a trailer slash.

When explicitly setting each individual file from the index in a list it shows

Retrieving .snakemake/storage/gcs/ivynatal-tpu-life-sciences/ref/GRCh38/GCF_000001405.40_GRCh38.p14_salmon_index/refseq.bin from storage.
Retrieving from storage: gcs://ivynatal-tpu-life-sciences/ref/GRCh38/GCF_000001405.40_GRCh38.p14_salmon_index/refseq.bin
Finished retrieval.

It seems to work this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants