archaic_match

Calculate archaic match percent for haplotype windows

Installation

Latest development version

    pip install git+https://github.com/lparsons/archaic_match

Other (non-python) requirements

sqlite3
tabix

Usage

Calculate match percent counts to create null distribution database

    archaic_match max-match-pct
        --vcf data/simulated_test/null/vcfs/Tenn_nonAfr_*_n1_0.1_n2_0.0.mod.vcf.gz
        --populations data/simulated_test/null/Tenn.popfile  # file with columns for sample and population
        --archaic-populations Neand1  # archaic population to match against
        --modern-populations EUR AFR  # populations to generate match percents for
        --chrom-sizes 1000000  # or file with chrom and size columns
        [--window-size [BP] (default: 50000)]
        [--step-size [BP] (default: 10000)]
        [--informative_site_method {derived_in_archaic,derived_in_archaic_or_modern} [default: derived_in_archaic]
        > output/simulated_test/null_tables/afr_eur-vs-neand1/tsv/max_match_pct_counts_all.tsv

Build database from match percent count table(s)

    archaic_match build-db
        --match-pct-count output/simulated_test/null_tables/afr_eur-vs-neand1/tsv/max_match_pct_counts_all.tsv
        --db output/simulated_test/null_tables/afr_eur-vs-neand1/max_match_pct_test.db

Calculate match percents and pvalues

Compares each modern haplotype to each archaic haplotype.

For each window (as defined by window_size and step_size), calculates the frequency of informative sites and the maximum archaic match percent.

Additionally, a empirical pvalue is computed using the database of precomputed match percents using windows that have an informative site frequency within frequency_threshold of the query window.

    archaic_match max-match-pct
        --vcf data/simulated_test/0.1pct/vcfs/Tenn_nonAfr_*_n1_0.1_n2_0.0.mod.vcf.gz
        --populations data/simulated_test/null/Tenn.popfile  # file with columns for sample and population
        --archaic-populations Neand1  # archaic population to match against
        --modern-populations EUR AFR  # populations to generate match percents for
        --chrom-sizes=1000000  # or file with chrom and size columns
        [--window-size [BP] (default: 50000)]
        [--step-size [BP] (default: 10000)]
        [--informative_site_method {derived_in_archaic,derived_in_archaic_or_modern} [default: derived_in_archaic]
        --match-pct-database output/simulated_test/null_tables/afr_eur-vs-neand1/max_match_pct.db
        [--informative-site-range [FLOAT] (default: 0)]
        [--overlap-regions [BED_FILE]]
        > output/simulated_test/0.1_pct_pvalues.txt

Optional region overlap

Specifying a bed file with regions will generate two additional output columns:

region_overlap_bp: The number of basepairs of the window that overlap an introgressed region.
region_informative_sites: The number of informative sites in the window that overlap an introgressed region.

This is helpful for assessing the amount of a window that in known to be introgressed in simulated data.

The format of the region bedfile should be a tab delimited file with the following columns:

Chromsome (sequence id where the region is located)
Start (zero based start position of the region)
End (one-based end position of the region)
Sample [Optional] (if specified, only windows associated with this sample will be reported as having overlap)

Note for TreeCalls bedfiles

The TreeCalls bedfiles output by the ??? simulator should have the haplotype id converted into the associated sample name (e.g. 220 => msp_110). They should also be combined into a single bedfile. This can be accomplished using the included column_replace command (e.g.):

    column_replace \
        data/simulated_test/0.05_pct/TreeCalls/Tenn_nonAfr_*_n1_0.05_n2_0.0.bed.merged.gz \
        -d data/simulated_test/Tenn_haplotype_to_sample.txt \
        -c 4 \
        | sort -k 1,1 -k 2,2n \
        > data/simulated_test/0/05_pct/TreeCalls/combined_introgressed_regions.bed

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.circleci		.circleci
archaic_match		archaic_match
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements-test.txt		requirements-test.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.circleci

.circleci

archaic_match

archaic_match

tests

tests

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

requirements-test.txt

requirements-test.txt

setup.py

setup.py

Repository files navigation

archaic_match

Installation

Usage

Calculate match percent counts to create null distribution database

Build database from match percent count table(s)

Calculate match percents and pvalues

Optional region overlap

About

Releases

Packages

Languages

License

lparsons/archaic_match

Folders and files

Latest commit

History

Repository files navigation

archaic_match

Installation

Usage

Calculate match percent counts to create null distribution database

Build database from match percent count table(s)

Calculate match percents and pvalues

Optional region overlap

About

Resources

License

Stars

Watchers

Forks

Languages