Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mutations relative to arbitrary node #1454

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented May 14, 2024

This extends concept of private mutations (private mutations are mutations relative to the parent node on the ref tree) to a more general concept of mutations relative to an arbitrary node of interest.

The ref nodes of interest are described by the user in the .meta.extensions.nextclade.reference_nodes of the input Auspice JSON. The description can also contain constraints: we can match node to only query samples belonging to a certain clade or lineage.

Private mutations functionality is unchanged. New functionality, inputs and outputs are added on top. Though the implementation algo is largely reused.

Test

PR in data for testing: nextstrain/nextclade_data#198 (branch with the same name). Dataset nextstrain/sars-cov-2/wuhan-hu-1/proteins there has reference_nodes config added to tree.json. Can be used like this:

https://nextclade-git-feat-mutations-relative-to-node-nextstrain.vercel.app/?dataset-server=gh&dataset-name=nextstrain/sars-cov-2/wuhan-hu-1/proteins

Work items

  • read input config from Auspice JSON
  • calculate relative nuc mutations
  • calculate relative aa mutations
  • filter by clade and clade-like attributes
  • output to Nextclade JSON
  • output to Nextclade NDJSON
  • output to TSV and CSV
  • pass required data between js and wasm
  • display in web app

For consideration:

  • ? use regex for matching clade-like values, rather than string equality
  • ? filter by gene

Inputs

Example configuration object. Put it into .meta of Auspice JSON (such that it becomes .meta.extensions.nextclade.reference_nodes)

Click to expand
{
  "extensions": {
    "nextclade": {
      "reference_nodes": [
        {
          "name": "NODE_0000659",
          "displayName": "BA.2.86 (23I)",
          "description": "Ancestral BA.2.86 sequence"
        },
        {
          "name": "XBB.1.5",
          "displayName": "XBB.1.5 (23A)",
          "description": "Ancestral XBB.1.5 sequence. Vaccine strain 2023/2024",
          "include": {
            "clade": ["23A"]
          }
        },
        {
          "name": "NODE_0000862",
          "displayName": "BA.5 (22B)",
          "description": "Ancestral BA.5 sequence. Vaccine strain 2022/2023",
          "include": {
            "clade": ["22B"]
          }
        }
      ]
    }
  }
}
  • The name field should match the name field of one of the nodes on the tree.

  • The displayName and description are optional arbitrary strings used for display purposes.

  • The include field should be an object, which contains:

    • keys: names from the .meta.extensions.nextclade.clade_node_attrs (for clade-like attributes) or string "clade" (for the built-in clades).
    • values: a list of values of the clade-like attribute or a list of built-in clades. Only query sequences which match these attributes are considered for calculation of mutations relative to that node.

    If the include field is not present, then no constraints applied (all query sequences are considered).

Outputs

Output JSON and NDJSON

Example fragment of output json entry (entry in the .results[] array) (mutation lists are truncated for demonstration purposes)

Click to expand
{
  "relativeNucMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "privateSubstitutions": [
          {"pos": 404, "refNuc": "A", "qryNuc": "G"},
          {"pos": 896, "refNuc": "A", "qryNuc": "C"}
        ],
        "privateDeletions": [],
        "reversionSubstitutions": [
          {"pos": 896, "refNuc": "A", "qryNuc": "C"},
          {"pos": 3430, "refNuc": "T", "qryNuc": "G"}
        ],
        "labeledSubstitutions": [
          {
            "substitution": {"pos": 404, "refNuc": "A", "qryNuc": "G"},
            "labels": ["23A", "23D", "23F", "23B", "22F", "23E", "23H", "23G"]
          },
          {
            "substitution": {"pos": 2333, "refNuc": "C", "qryNuc": "T"},
            "labels": ["23F", "23H"]
          }
        ],
        "unlabeledSubstitutions": [
          {"pos": 4089, "refNuc": "C", "qryNuc": "T"},
          {"pos": 11344, "refNuc": "C", "qryNuc": "T"}
        ],
        "totalPrivateSubstitutions": 75,
        "totalPrivateDeletions": 0,
        "totalReversionSubstitutions": 37,
        "totalLabeledSubstitutions": 34,
        "totalUnlabeledSubstitutions": 4
      }
    }
  ],
  "relativeAaMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "E": {
          "privateSubstitutions": [
            {"cdsName": "E", "pos": 10, "refAa": "T", "qryAa": "A"}
          ],
          "privateDeletions": [],
          "reversionSubstitutions": [],
          "totalPrivateSubstitutions": 1,
          "totalPrivateDeletions": 0,
          "totalReversionSubstitutions": 0
        },
        "S": {
          "privateSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 26, "refAa": "-", "qryAa": "S"}
          ],
          "privateDeletions": [
            {"cdsName": "S", "pos": 23, "refAa": "S"},
            {"cdsName": "S", "pos": 143, "refAa": "Y"}
          ],
          "reversionSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 49, "refAa": "L", "qryAa": "S"}
          ],
          "totalPrivateSubstitutions": 39,
          "totalPrivateDeletions": 2,
          "totalReversionSubstitutions": 25
        }
      }
    }
  ]
}

Output TSV and CSV

TODO

Visualization in Nextclade Web

TODO

This extends concept of private mutations (relative to the parent node on the ref tree) to mutations relative to an arbitrary node of interest.

The ref nodes of interest are described by the user in the `.meta .extensions .nextclade .reference_nodes` of the input Auspice JSON. The description can also contain constrains: we can match node to only query samples belonging to a certain clade or lineage.

Private mutations functionality is unchanged, this is only an addition. Though the implementation algo is largely reused.

On this commit only nuc mutations are added.
Similarly to b537132, add relative amino acid mutations
This just passes through from js to wasm the data that is now required to output relative mutations
Copy link

vercel bot commented May 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview Jun 7, 2024 5:37pm

Comment on lines +40 to +63
ref_nodes
.iter()
.map(|&ref_node| -> Result<_, Report> {
let node = graph
.iter_nodes()
.find(|node| node.payload().name == ref_node.name)
.ok_or_else(|| eyre!("Unable to find reference node on the tree: '{}'", &ref_node.name))?;

let muts = find_private_nuc_mutations(
node.payload(),
substitutions,
deletions,
missing,
alignment_range,
ref_seq,
non_acgtns,
virus_properties,
);

Ok(RelativeNucMutations {
ref_node: ref_node.to_owned(),
muts,
})
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little new logic, mostly bookkeeping. The find_private_*_mutations() for nucs and aa are reused as is. The only difference compared to private mutations is that the code now runs multiple times, for each requested node.

This code fragment is for nucs. The sibling function for aa is just below that.

Comment on lines 126 to 135
.reference_nodes
.iter()
.filter(|node| {
// For each attribute key in includes, check that the attribute value of this sample match
// at least one item in the include list
node.include.iter().all(|(key, includes)| {
let curr_value = if key == "clade" { clade } else { &clade_node_attrs[key] };
includes.iter().any(|include_value| include_value == curr_value) // TODO: consider regex match rather than equality
})
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic for constraining the mutations calculation by clades and clade-like attributes. If include field is present, then we lookup the constrained attribute on the query sample and only consider this node if the query attribute's value is matching any of the values in the include list.

For example, if config has node of interest which is only relevant for clades 23A and 23B:

{
  "...": "...",
  "include": { "clade": ["23A", "23B"] }
}

then mutations relative to this node will be calculated only for query samples of clade 23A and 23B.

Same for pango lineages:

{
  "...": "...",
  "include": { "Nextclade_pango": ["A.1.2.3", "A.1.2.3.4"] }
}

It is up for discussion how multiple filters (multiple keys in the include object) should be combined - using boolean OR or boolean AND.

ivan-aksamentov added a commit that referenced this pull request Jun 6, 2024
In preparation for #1454

Among other things, in order to render private and relative aa mutations we need to group them similarly to how absolute aa mutations are grouped. This involves finding adjacent mutations and nuc context for these mutations.

Sadly, the code for grouping is quite complex and is not immediately reusable (it combines aa mutation search and grouping at the same time), so let's do some refactoring.

Let's start from factoring away structs and functions that will be unchanged, to clear up some space for the action, to minimize diffs  and to reduce scrolling.
This should allow us to reduce nesting and to allow fallible operations in the map
After refactoring, aa_changes_group() is generic enough to be able to reuse it for private aa mutations too.

This is a first working sketch. The inputs are likely wrong and fixes will follow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant