Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use Auspice JSON as a dataset #1455

Merged
merged 20 commits into from
May 29, 2024

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented May 16, 2024

Slack threads:

This allows to use Auspice JSON v2 as input dataset. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.

Work items

  • parse genome annotation, reference and pathogen info from Auspice JSON in CLI, when passing Auspice JSON filepath to --input-dataset
  • parse genome annotation, reference and pathogen info from Auspice JSON in Web, when passing URL to Auspice JSON file into ?dataset-json-url URL param. Note that the URL passed as an argument to the URL param might need to be percent-encoded (urlencoded).
  • accept Auspice JSON in read-annotation command. This allows to visualize genome annotation from Auspice JSON the way Nextclade sees it. Might be useful for debugging.

Requirements

  • JSONs must contain ref nuc sequence .root_sequence.nuc field and clade_membership node attributes. The clade_membership requirement will be lifted in the near future #1457. (clade_membership is no longer required)
  • Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

Data sources within Auspice JSON

required? what path in Auspice JSON Notes
yes reference sequence .root_sequence.nuc 1, 2
no genome annotation .meta.genome_annotations 3, 4
no pathogen info .meta.extensions.nextclade.pathogen 5
no examples ???
no readme ???
no changelog ???

Notes:

  1. If .root_sequence.nuc is missing and reference sequence is not provided otherwise (e.g. using individual args/params), then an error is thrown.

  2. Auspice JSON does not contain name of the reference in .root_sequence.nuc. When writing reference sequence to outputs (with --include-reference in CLI and always in Web), the name is taken from pathogen info at .meta.extensions.nextclade.pathogen.attributes["reference name"] if present. Otherwise a hardcoded value "reference" is used.

  3. Genome annotation in the Auspice format (Gff annotations augur#354, Entropy panel mk2 auspice#1684)

  4. If .meta.genome_annotations is missing, similarly to when genome_annotation.gff3 is missing, an empty annotation will be used, meaning translation, aa mutation calling and anything else related to amino acids will not run.

  5. Object of the same format as pathogen.json. Just paste contents of pathogen.json into a new field: .meta.extensions.nextclade.pathogen. If pathogen info is missing, then QC and other features configurable in pathogen.json will be disabled. There will be no pretty dataset name and ref sequence name.

Examples

Future work

  • .meta.extensions.nextclade.pathogen.files or another similar field could contain URLs to the (1) data which cannot be in JSON (e.g. example sequences, readme, changelog) or (2) you could have a separate ref sequence, annotation and pathogen.json files if for some reason you decide that Auspice fields do not suite you.

I had to derive a bunch of Eq and PartialEq traits to satisfy parent type requirements
This allows to pass a path to Auspice JSON v2 to `--input-dataset` CLI argument. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.
Copy link

vercel bot commented May 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview May 24, 2024 2:50pm

Let's add an explicit `Accept` HTTP header when fetching Auspice JSON. This is required for nextstrain.org links to work - the server sends different content depending on `Accept` header.
When using Auspice JSON as an input dataset, if pathogen info is not present, let's also attempt to read `.meta.title` or `.meta.description` and use as a dataset name. And let's try to read `.meta.updated` as the updated date time of the dataset.

This allows for a prettier and more informative dataset info section when using Auspice JSON as an input dataset.
@ivan-aksamentov ivan-aksamentov force-pushed the feat/ref-and-ann-from-tree-json branch from e326b5d to 44fb8a5 Compare May 24, 2024 14:40
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 24, 2024

Added a few more changes:

  • you can now correctly override Auspice dataset components using --input-* args and &input-* params. That's probably a bit too much of a Frankenstein customization, but I added it for consistency with dir-based dataset and maybe someone will find it useful.
  • read .meta.title, .meta.description and .meta.updated for pretty display in Nextclade Web

@rneher
Copy link
Member

rneher commented May 29, 2024

This H5 dataset works well (though I have to remove the non-existing files from the pathogen.json):

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/groups/neherlab/avian-flu/h5n1-cattle-outbreak/genome

I there anything stopping this from being merged?

@ivan-aksamentov
Copy link
Member Author

@rneher

I there anything stopping this from being merged?

No, other than crash on that large RSV tree. I am working on removing recursion.

@ivan-aksamentov ivan-aksamentov merged commit 9a172d1 into master May 29, 2024
20 checks passed
@ivan-aksamentov ivan-aksamentov deleted the feat/ref-and-ann-from-tree-json branch May 29, 2024 18:54
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 29, 2024

We decided to merge it to master.

We found a tree which crashes Nextclade Web. But it seems unrelated to this feature. Rather, the tree is too deep and we overflow the call stack in a recursive call when converting internal representation into tree json for output. We will try to remove recursion and see how it goes.

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

We would also need to followup with CLI and Web docs - for args and URL parameters, as well as some basic explanations. Though the feature will only be used internally for some time and we might need to figure out how exactly to document it first, from experience.

@jameshadfield
Copy link
Member

We decided to merge it to master.

Awesome! I'll work on adding a link-out within auspice-on-nextstrain.org to kick the dataset over to nextclade (if there's a root-sequence and if the root-sequence is actually the root sequence and not a reference, although that's hard to tell)

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 29, 2024

@jameshadfield

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

Nope, and I just wanted to link the known problematic tree here, but the problem is no longer reproducible. Here is the link:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/charon/getDataset?prefix=rsv/a/F

I think the tree might have changed at that address. But I have a copy of the old tree that crashes Nextclade:

tree.json.gz

Too big to upload to GitHub plaintext (71 MB out of 25 MB), so sadly no Nextclade link possible to try it.

P.S. Interesting that the new tree is twice as small: 34 MB.

tree_new.json.gz

P.P.S. I formatted both trees with prettier for readability and comparison.

@jameshadfield
Copy link
Member

P.P.P.S you don't have to use the charon API - because you (Nextclade) use the appropriate Accept headers you can hit the dataset address directly:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F

And you can make use of the snapshot/version identifier to access past trees, which should expose the broken tree. Something like this:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F@2024-05-15

@tsibley
Copy link
Member

tsibley commented May 29, 2024

I'm just catching up on activity here (cool!) so if any of this is not applicable any more, my apologies!

Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

This seems like a huge hazard, yeah? Especially once this feature is documented or there's greater awareness/usage of it outside ourselves. (As soon as we start linking to Nextclade from Auspice, I'm sure folks will notice and try it themselves.)

Instead of Nextclade accepting any Auspice JSON and requiring this implicit assertion to hold true without having any way to verify it, could we take a different approach? For example, we could have a flag in the JSON (either in .meta somewhere or dangling off .root_sequence somewhere) that allows a Nextstrain dataset to opt-into Nextclade compatibility, e.g. declares "yes, my .root_sequence.nuc was correctly generated for use with Nextclade". This would require co-operation from Augur, but that's very doable. I can see still having a way to force/override this flag (e.g. via a query param/CLI option) for older Nextstrain datasets without it, but it should be obvious that you must know what you're doing.

JSONs must contain ref nuc sequence .root_sequence.nuc field

Is there a plan to support fetching the root sequence from the sidecar file, e.g. with Accept: application/vnd.nextstrain.dataset.root-sequence+json? If this support existed, for example, then pointing Nextclade at https://nextstrain.org/ncov/gisaid/global/all-time would work.

If this won't be supported, it seems to me that the decision to inline the root sequence in the main JSON or relegate it to a sidecar will no longer be determined by Auspice performance and genome size but instead by "do folks want this Nextstrain dataset to work with Nextclade?" And the answer to that will always hew towards "yes, of course", so every root sequence will end up inlined. If we're ok with that (???), can we get ahead of it and ditch the sidecar entirely then? (or rather, recommend always inlining going forward and change Augur's default)

@jameshadfield
Copy link
Member

This would require co-operation from Augur, but that's very doable.

I have confused myself in the past with the differences here, and our language in augur is itself confusing / inconsistent, largely because this subtlety was added to accommodate nextclade dataset construction in the first place. So I was thinking of starting by writing up good docs here (along the lines of these ones) - there's already commentary on this, but it's scattered across slack / issues.

Totally support the aim here which is to not jump into nextclade if the dataset's not valid (but still "works")

ivan-aksamentov added a commit that referenced this pull request May 30, 2024
Followup of #1455

If `.root_sequence` is not available on Auspice JSON, let's attempt to fetch ref sequence from sidecar Auspice JSON. For that let's GET from the same URL, but with `Accept: application/vnd.nextstrain.dataset.root-sequence+json` header.
@ivan-aksamentov
Copy link
Member Author

@tsibley @jameshadfield my attempt to add sidecar JSON is here: #1460

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 30, 2024

Regarding the "yes, I understand what I am doing" flag - it could be useful. This rule is universal for all trees, not just trees which happen to also be full datasets. For official datasets this flag can be added easily. I'd ask our scientists what they think.

@rneher
Copy link
Member

rneher commented May 30, 2024

Nextclade does check for consistency of the provided root sequence with the mutations in the tree (we didn't initially, but we do in v3). If there is a mutation on the tree like A321G and the root sequence is not A at this position, it will error. But it can't do this consistency check for positions when there is no mutation anywhere on the tree.

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

@tsibley
Copy link
Member

tsibley commented May 30, 2024

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

Nod. Size-wise, sure, but I think the other reasons boiled down to an Auspice load-time optimization. @jameshadfield probably has the most historical context here on hand without further digging.

@tsibley
Copy link
Member

tsibley commented May 30, 2024

A couple of UI things I noted.

External datasets like this are tagged "community", e.g.

image

and while I get why this is, it feels potentially pretty confusing to users. In particular because a) this is an official nextstrain.org core dataset and b) nextstrain.org has its own meaning for "community dataset". Should we rethink the UI here a little? Mark https://nextstrain.org core datasets as "official"? Or with some other label than "community"? (These are unification/integration pains, which feel hard, but also worth it.)

For me, the "Suggest automatically" feature of Nextclade was enabled by default (though it seems like a sticky preference?). With it enabled:

  1. Load a Nextstrain dataset via the URL, e.g. https://nextclade-git-feat-fetch-auspice-sidecar-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/ncov/gisaid/global/all-time
  2. Load a file of sequences
  3. Notice suggestions are made for an official dataset, but the one manually loaded via the URL remains selected.
  4. Run the analysis.
  5. Flip back to "Start".
  6. Notice that the dataset has been switched out for one of the suggested ones.

This seems confusing?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented May 30, 2024

@tsibley The second one sounds like a serious bug. I made a new issue: #1462

And for first one as well: #1463

Thanks for feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants