Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH(nextalign cli): show default values in --help usage statement #1253

Open
AngieHinrichs opened this issue Sep 11, 2023 · 4 comments
Open
Labels
t:feat Type: request of a new feature, functionality, enchancement

Comments

@AngieHinrichs
Copy link

Hi! I'm trying out nextalign on norovirus genomes (small ssRNA, ~7.5kb, but highly diverged), and most sequences are unalignable with nextalign's default settings (Unable to align: low seed matching rate. Details: number of seeds: 73, number of seed matches: 2, matching rate: 0.027, required matching rate: 0.300. Note that this sequence will not be included in the results.).

I'd like to try playing with the seed parameters. nextalign's --help statement describes the params but not their default values:

        --seed-length <SEED_LENGTH>
            k-mer length to determine approximate alignments between query and reference and
            determine the bandwidth of the banded alignment

        --mismatches-allowed <MISMATCHES_ALLOWED>
            Maximum number of mismatching nucleotides allowed for a seed to be considered a match

        --min-seeds <MIN_SEEDS>
            Minimum number of seeds to search for during nucleotide alignment. Relevant for short
            sequences. In long sequences, the number of seeds is determined by `--seed-spacing`

        --min-match-rate <MIN_MATCH_RATE>
            Minimum seed mathing rate (a ratio of seed matches to total number of attempted seeds)

        --seed-spacing <SEED_SPACING>
            Spacing between seeds during nucleotide alignment

It would be nice to know the default values as a starting point for exploring the parameter space. I guess I could figure out the seed length from the 'Unable to align' messages 🙂 but it would be very nice if the --help told them all. Thanks!

@AngieHinrichs AngieHinrichs added good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Sep 11, 2023
@corneliusroemer
Copy link
Member

corneliusroemer commented Sep 11, 2023 via email

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Sep 12, 2023

The hardcoded defaults for v2 are here (branch v2):

impl Default for AlignPairwiseParams {
fn default() -> Self {
Self {
min_length: 100,
penalty_gap_extend: 0,
penalty_gap_open: 6,
penalty_gap_open_in_frame: 7,
penalty_gap_open_out_of_frame: 8,
penalty_mismatch: 1,
score_match: 3,
max_indel: 400,
seed_length: 21,
min_seeds: 10,
min_match_rate: 0.3,
seed_spacing: 100,
mismatches_allowed: 3,
retry_reverse_complement: false,
no_translate_past_stop: false,
left_terminal_gaps_free: true,
right_terminal_gaps_free: true,
excess_bandwidth: 9,
terminal_bandwidth: 50,
gap_alignment_side: GapAlignmentSide::Right,
}
}
}

For v3 (not stable, branch master) the hardcoded defaults are here:

impl Default for AlignPairwiseParams {
fn default() -> Self {
Self {
min_length: 100,
penalty_gap_extend: 0,
penalty_gap_open: 6,
penalty_gap_open_in_frame: 7,
penalty_gap_open_out_of_frame: 8,
penalty_mismatch: 1,
score_match: 3,
max_band_area: 500_000_000, // requires around 500Mb for paths, 2GB for the scores
max_indel: 400, // obsolete
seed_length: 21, // obsolete
min_seeds: 10, // obsolete
min_match_rate: 0.3, // obsolete
seed_spacing: 100, // obsolete
mismatches_allowed: 3, // obsolete
retry_reverse_complement: false,
no_translate_past_stop: false,
left_terminal_gaps_free: true,
right_terminal_gaps_free: true,
gap_alignment_side: GapAlignmentSide::Right,
excess_bandwidth: 9,
terminal_bandwidth: 50,
min_seed_cover: 0.33,
kmer_length: 10, // Should not be much larger than 1/divergence of amino acids
kmer_distance: 50, // Distance between successive kmers
min_match_length: 40, // Experimentally determined, to keep off-target matches reasonably low
allowed_mismatches: 8, // Ns count as mismatches
window_size: 30,
max_alignment_attempts: 3,
}
}
}

There are 2 important changes to consider in the upcoming Nextclade v3:

  • alignment algo is changed quite a bit, so the params will change
  • Nextalign executable is removed. Instead, Nextclade will take over the same job. In the new dataset format most files will be optional (and the dataset is also optional, so individual input args can be used) - all this to emulate the interface of Nextalign and to facilitate incremental development of datasets.

Because we are removing Nextalign, it does not make sense to add params into its help text anymore, as we are not planning any more releases.

Regarding Nextclade: the datasets can (and do) override parameters (using virus_properties.json file for v2 and pathogen.json in the v3), because different viruses sometimes need some different tuning. So I think that the displayed hardcoded number might be inaccurate and misleading, depending on which dataset you are planning to run. But let me know if you think it makes sense to add hardcoded defaults to Nextclade v3 anyways.

In the meantime, one thing you can try is to add -v (--verbose) flag to the run command, and then the program should print the final values for this particular run, already taking into account values (in this order) in:

  • dataset (if using Nextclade and if they are defined)
  • CLI args (if an arg is provided)
  • hardcoded defaults

UPD:

This statement is incorrect for v2:

already taking into account values (in this order) in

Nextclade/Nextalign v2 only print the CLI args, before merging-in the defaults, which is probably not very useful. This will change in v3.

@ivan-aksamentov ivan-aksamentov removed needs triage Mark for review and label assignment good first issue Good for newcomers help wanted Extra attention is needed labels Sep 12, 2023
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Sep 12, 2023

If you want to try Nextclade v3:

You can download prebuilt binaries on GitHub Actions:

Or you can build it from source, from master branch, using our dev guide:
https://github.com/nextstrain/nextclade/blob/master/docs/dev/developer-guide.md

But v3 is not released and not stable yet. It's a bit of a crazy land still, and things might break. In which case you can try a slightly earlier version in the list of GitHub Actions. When things calm down a bit, we'll probably release an alpha version, or a few.

We appreciate early testing and feedback!

@AngieHinrichs
Copy link
Author

Thanks @ivan-aksamentov! I will give both a try. I see v3 can be run without a dataset if --input-ref is provided, great. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:feat Type: request of a new feature, functionality, enchancement
Projects
No open projects
Development

No branches or pull requests

3 participants