Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrappers for hap.py and dipcall #294

Open
nate-d-olson opened this issue Feb 27, 2021 · 11 comments
Open

wrappers for hap.py and dipcall #294

nate-d-olson opened this issue Feb 27, 2021 · 11 comments
Labels
enhancement New feature or request

Comments

@nate-d-olson
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Difficulty developing an assembly benchmarking pipeline due to the dependencies associated with two key bioinformatic tools use in the pipeline hap.py and dipcall. I know there is a snakemake wrapper for the hap.py pre.py command but we are running into a regex issue when running the hap.py bioconda package in the snakemake/snakemake docker container.

Describe the solution you'd like
We would like to have snakemake wrappers for hap.py, https://github.com/Illumina/hap.py, and dipcall, https://github.com/lh3/dipcall.

Describe alternatives you've considered
Tried using docker containers to handle dependencies but we were unable to get the pipeline to work on both macOS and Linux. Tried using the hap.py bioconda package but ran into a regex error when run in the snakemake/snakemake container. There is an issue on the hap.py github repo related to the error, Illumina/hap.py#66, which seems to be related to specific gcc versions.

Additional context
As part of the Genome in a Bottle project, https://www.nist.gov/programs-projects/genome-bottle we are working to improve the usability and interpretability of our methods for benchmarking variant calls. Hap.py is used for small variant benchmarking and is the reference implementation of the GA4GH best practices for small variant benchmarking. dipcall is used to generate variant calls from a diploid assembly. We developed a proof of concept pipeline using snakemake, https://github.com/usnistgov/giab-asm-benchmarking, but the pipeline only works on macOS due to a hack we used to handle the hap.py and dipcall dependencies.

@nate-d-olson nate-d-olson added the enhancement New feature or request label Feb 27, 2021
@mbargull
Copy link

mbargull commented Mar 1, 2021

Hi @nate-d-olson,

Tried using the hap.py bioconda package but ran into a regex error when run in the snakemake/snakemake container. There is an issue on the hap.py github repo related to the error, Illumina/hap.py#66, which seems to be related to specific gcc versions.

It looks like version 0.3.12 of hap.py fixed the regex issue but we only got that version into Bioconda a couple of months after its release and Snakemake wrapper creation.
I'll relay the need for the updated wrapper right away.
And please feel free to let us know of any further issue you might encounter after we get the updated wrapper up!

Cheers,
Marcel

@johanneskoester
Copy link
Contributor

There is another fix needed upstream, see Illumina/hap.py#113

@johanneskoester
Copy link
Contributor

#295

@johanneskoester
Copy link
Contributor

Ok, first PR is merged, hap.py should now work fine for you.

@johanneskoester
Copy link
Contributor

@mbargull has a look at dipcall now.

@nate-d-olson
Copy link
Contributor Author

Thanks for your responsiveness and for working to update the hap.py bioconda package and hap.py/pre.py wrapper. We use the main hap.py function, which does the comparison and calculates performance metrics, in our pipeline. I tried to run the main hap.py command, hap.py {TRUTH}.vcf.gz {QUERY}.vcf.gz -f {TRUTH REGIONS}.bed -r {REF}.fa --engine=vcfeval --stratification {STRATIFICATIONS}.tsv -o {OUTDIR}, with the new conda package and the snakemake/snakemake docker container and get the following error, likely introduced with the recent update to hap.py (Illumina/hap.py#113).

2021-03-03 16:37:26,577 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py 
2021-03-03 16:37:26,700 ERROR    'Namespace' object has no attribute 'convert_gvcf_truth'
2021-03-03 16:37:26,700 ERROR    Traceback (most recent call last):
2021-03-03 16:37:26,701 ERROR      File "/opt/conda/envs/happy/bin/hap.py", line 529, in <module>
2021-03-03 16:37:26,701 ERROR        main()
2021-03-03 16:37:26,701 ERROR      File "/opt/conda/envs/happy/bin/hap.py", line 286, in main
2021-03-03 16:37:26,701 ERROR        if args.convert_gvcf_truth:
2021-03-03 16:37:26,701 ERROR    AttributeError: 'Namespace' object has no attribute 'convert_gvcf_truth'

@johanneskoester
Copy link
Contributor

They should really have a test framework. Yes, the main has to be adapted as well.

@nate-d-olson
Copy link
Contributor Author

Here is my first attempt at making a wrapper for the main hap.py command. nate-d-olson@34e7de2

I had a few questions regarding recommendations/ best practices for creating wrappers.

  1. Defining default values for a variable
  2. Recommendations for when the output is a directory, or should all expected output files be explicitly defined.
  3. Recommendation for when input is a directory (here a directory of bed files is used to define the stratified inputs)
  4. Suggested file size for test data.

@johanneskoester
Copy link
Contributor

Thanks for asking, I've added comments on your commit.

@nate-d-olson
Copy link
Contributor Author

nate-d-olson commented Mar 30, 2021

@jafors and @johanneskoester thank you for your help with the hap.py bioconda packages and snakemake-wrapper. Now that we have a dipcall bioconda package I wanted to start working on the dipcall snakemake-wrapper. Running dipcall is a two-step process; first run-dipcall generates a makefile with the dipcall pipeline, then make is used to execute the pipeline. See example command from the dipcall README below, https://github.com/lh3/dipcall. Should the dipcall wrapper perform both steps, have separate wrappers for both step with a meta-wrapper combining the two, or something else?

dipcall.kit/run-dipcall prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
# or for male, requiring PAR regions in BED
# dipcall.kit/run-dipcall -x dipcall.kit/hs38.PAR.bed prefix hs38.fa pat.fa.gz mat.fa.gz > prefix.mak
make -j2 -f prefix.mak

@johanneskoester
Copy link
Contributor

So, I'd say, if the makefile generation is super quick, just put both in one wrapper. If it takes time, while being less parallel than the second step, use two wrappers, and ideally add a meta-wrapper for their combination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants