feat: use Auspice JSON as a dataset #1455

ivan-aksamentov · 2024-05-16T12:41:35Z

Slack threads:

This allows to use Auspice JSON v2 as input dataset. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.

Work items

parse genome annotation, reference and pathogen info from Auspice JSON in CLI, when passing Auspice JSON filepath to --input-dataset
parse genome annotation, reference and pathogen info from Auspice JSON in Web, when passing URL to Auspice JSON file into ?dataset-json-url URL param. Note that the URL passed as an argument to the URL param might need to be percent-encoded (urlencoded).
accept Auspice JSON in read-annotation command. This allows to visualize genome annotation from Auspice JSON the way Nextclade sees it. Might be useful for debugging.

Requirements

JSONs must contain ref nuc sequence .root_sequence.nuc field ~~and clade_membership node attributes. The clade_membership requirement will be lifted in ~~the near future~~ #1457.~~ (clade_membership is no longer required)
Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

Data sources within Auspice JSON

required?	what	path in Auspice JSON	Notes
yes	reference sequence	`.root_sequence.nuc`	1, 2
no	genome annotation	`.meta.genome_annotations`	3, 4
no	pathogen info	`.meta.extensions.nextclade.pathogen`	5
no	examples	???
no	readme	???
no	changelog	???

Notes:

If .root_sequence.nuc is missing and reference sequence is not provided otherwise (e.g. using individual args/params), then an error is thrown.
Auspice JSON does not contain name of the reference in .root_sequence.nuc. When writing reference sequence to outputs (with --include-reference in CLI and always in Web), the name is taken from pathogen info at .meta.extensions.nextclade.pathogen.attributes["reference name"] if present. Otherwise a hardcoded value "reference" is used.
Genome annotation in the Auspice format (Gff annotations augur#354, Entropy panel mk2 auspice#1684)
If .meta.genome_annotations is missing, similarly to when genome_annotation.gff3 is missing, an empty annotation will be used, meaning translation, aa mutation calling and anything else related to amino acids will not run.
Object of the same format as pathogen.json. Just paste contents of pathogen.json into a new field: .meta.extensions.nextclade.pathogen. If pathogen info is missing, then QC and other features configurable in pathogen.json will be disabled. There will be no pretty dataset name and ref sequence name.

Examples

~~From nextstrain.org: https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app?dataset-json-url=https://nextstrain.org/charon/getDataset?prefix=ncov/gisaid/global/all-time~~ (does not contain required .root_sequence.nuc. Please suggest a json that has it)
✔️ SC2 from GitHub (the tree originally contains .root_sequence.nuc, as well as .meta.genome_annotations and I added .meta.extensions.nextclade.pathogen for pretty dataset name display): https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app?dataset-json-url=https://github.com/nextstrain/nextclade_data/blob/feat/ref-and-ann-from-tree-json/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/tree.json
💥 crashing H5 from GitHub: https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://github.com/rneher/h5_cattle_genome/blob/master/tree.json

Future work

.meta.extensions.nextclade.pathogen.files or another similar field could contain URLs to the (1) data which cannot be in JSON (e.g. example sequences, readme, changelog) or (2) you could have a separate ref sequence, annotation and pathogen.json files if for some reason you decide that Auspice fields do not suite you.

I had to derive a bunch of Eq and PartialEq traits to satisfy parent type requirements

This allows to pass a path to Auspice JSON v2 to `--input-dataset` CLI argument. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.

vercel · 2024-05-16T12:41:38Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
nextclade	✅ Ready (Inspect)	Visit Preview	May 24, 2024 2:50pm

This allows to input Auspice JSON as Nextclade dataset to the web app.

…om-tree-json

packages/nextclade-web/src/io/fetchSingleDatasetAuspice.ts

Let's add an explicit `Accept` HTTP header when fetching Auspice JSON. This is required for nextstrain.org links to work - the server sends different content depending on `Accept` header.

…ded otherwise

When using Auspice JSON as an input dataset, if pathogen info is not present, let's also attempt to read `.meta.title` or `.meta.description` and use as a dataset name. And let's try to read `.meta.updated` as the updated date time of the dataset. This allows for a prettier and more informative dataset info section when using Auspice JSON as an input dataset.

ivan-aksamentov · 2024-05-24T14:42:15Z

Added a few more changes:

you can now correctly override Auspice dataset components using --input-* args and &input-* params. That's probably a bit too much of a Frankenstein customization, but I added it for consistency with dir-based dataset and maybe someone will find it useful.
read .meta.title, .meta.description and .meta.updated for pretty display in Nextclade Web

rneher · 2024-05-29T15:22:54Z

This H5 dataset works well (though I have to remove the non-existing files from the pathogen.json):

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/groups/neherlab/avian-flu/h5n1-cattle-outbreak/genome

I there anything stopping this from being merged?

ivan-aksamentov · 2024-05-29T15:42:30Z

@rneher

I there anything stopping this from being merged?

No, other than crash on that large RSV tree. I am working on removing recursion.

ivan-aksamentov · 2024-05-29T18:57:51Z

We decided to merge it to master.

We found a tree which crashes Nextclade Web. But it seems unrelated to this feature. Rather, the tree is too deep and we overflow the call stack in a recursive call when converting internal representation into tree json for output. We will try to remove recursion and see how it goes.

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

We would also need to followup with CLI and Web docs - for args and URL parameters, as well as some basic explanations. Though the feature will only be used internally for some time and we might need to figure out how exactly to document it first, from experience.

jameshadfield · 2024-05-29T19:36:40Z

We decided to merge it to master.

Awesome! I'll work on adding a link-out within auspice-on-nextstrain.org to kick the dataset over to nextclade (if there's a root-sequence and if the root-sequence is actually the root sequence and not a reference, although that's hard to tell)

The dataset trees are rather small, but this is not generally true for the auspice trees out there. In the meantime, the excessively big trees should be avoided.

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

ivan-aksamentov · 2024-05-29T19:45:19Z

@jameshadfield

Any back-of-the envelope numbers here which I could use to disable the link / add a warning?

Nope, and I just wanted to link the known problematic tree here, but the problem is no longer reproducible. Here is the link:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/charon/getDataset?prefix=rsv/a/F

I think the tree might have changed at that address. But I have a copy of the old tree that crashes Nextclade:

tree.json.gz

Too big to upload to GitHub plaintext (71 MB out of 25 MB), so sadly no Nextclade link possible to try it.

P.S. Interesting that the new tree is twice as small: 34 MB.

tree_new.json.gz

P.P.S. I formatted both trees with prettier for readability and comparison.

jameshadfield · 2024-05-29T21:53:00Z

P.P.P.S you don't have to use the charon API - because you (Nextclade) use the appropriate Accept headers you can hit the dataset address directly:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F

And you can make use of the snapshot/version identifier to access past trees, which should expose the broken tree. Something like this:

https://nextclade-git-feat-ref-and-ann-from-tree-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/rsv/a/F@2024-05-15

tsibley · 2024-05-29T22:51:54Z

I'm just catching up on activity here (cool!) so if any of this is not applicable any more, my apologies!

Scientifically, root node of the tree should either correspond to reference sequence or to contain all mutations between reference sequence and sequence corresponding to the root node. This is ensured by authors of Nextclade datasets, but this is not generally true for Auspice JSONs. Nextclade has no possibility to verify this correspondence. If this assumption is violated, it will produce incorrect results.

This seems like a huge hazard, yeah? Especially once this feature is documented or there's greater awareness/usage of it outside ourselves. (As soon as we start linking to Nextclade from Auspice, I'm sure folks will notice and try it themselves.)

Instead of Nextclade accepting any Auspice JSON and requiring this implicit assertion to hold true without having any way to verify it, could we take a different approach? For example, we could have a flag in the JSON (either in .meta somewhere or dangling off .root_sequence somewhere) that allows a Nextstrain dataset to opt-into Nextclade compatibility, e.g. declares "yes, my .root_sequence.nuc was correctly generated for use with Nextclade". This would require co-operation from Augur, but that's very doable. I can see still having a way to force/override this flag (e.g. via a query param/CLI option) for older Nextstrain datasets without it, but it should be obvious that you must know what you're doing.

JSONs must contain ref nuc sequence .root_sequence.nuc field

Is there a plan to support fetching the root sequence from the sidecar file, e.g. with Accept: application/vnd.nextstrain.dataset.root-sequence+json? If this support existed, for example, then pointing Nextclade at https://nextstrain.org/ncov/gisaid/global/all-time would work.

If this won't be supported, it seems to me that the decision to inline the root sequence in the main JSON or relegate it to a sidecar will no longer be determined by Auspice performance and genome size but instead by "do folks want this Nextstrain dataset to work with Nextclade?" And the answer to that will always hew towards "yes, of course", so every root sequence will end up inlined. If we're ok with that (???), can we get ahead of it and ditch the sidecar entirely then? (or rather, recommend always inlining going forward and change Augur's default)

jameshadfield · 2024-05-29T23:00:24Z

This would require co-operation from Augur, but that's very doable.

I have confused myself in the past with the differences here, and our language in augur is itself confusing / inconsistent, largely because this subtlety was added to accommodate nextclade dataset construction in the first place. So I was thinking of starting by writing up good docs here (along the lines of these ones) - there's already commentary on this, but it's scattered across slack / issues.

Totally support the aim here which is to not jump into nextclade if the dataset's not valid (but still "works")

Followup of #1455 If `.root_sequence` is not available on Auspice JSON, let's attempt to fetch ref sequence from sidecar Auspice JSON. For that let's GET from the same URL, but with `Accept: application/vnd.nextstrain.dataset.root-sequence+json` header.

ivan-aksamentov · 2024-05-30T00:20:11Z

@tsibley @jameshadfield my attempt to add sidecar JSON is here: #1460

ivan-aksamentov · 2024-05-30T00:46:04Z

Regarding the "yes, I understand what I am doing" flag - it could be useful. This rule is universal for all trees, not just trees which happen to also be full datasets. For official datasets this flag can be added easily. I'd ask our scientists what they think.

rneher · 2024-05-30T06:58:24Z

Nextclade does check for consistency of the provided root sequence with the mutations in the tree (we didn't initially, but we do in v3). If there is a mutation on the tree like A321G and the root sequence is not A at this position, it will error. But it can't do this consistency check for positions when there is no mutation anywhere on the tree.

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

tsibley · 2024-05-30T17:55:07Z

I don't think there is a reason not to inline the root sequence for most of the viruses. The root sequence is 100x smaller than the rest of the tree. For bacteria this is a different consideration.

Nod. Size-wise, sure, but I think the other reasons boiled down to an Auspice load-time optimization. @jameshadfield probably has the most historical context here on hand without further digging.

tsibley · 2024-05-30T19:01:02Z

A couple of UI things I noted.

External datasets like this are tagged "community", e.g.

and while I get why this is, it feels potentially pretty confusing to users. In particular because a) this is an official nextstrain.org core dataset and b) nextstrain.org has its own meaning for "community dataset". Should we rethink the UI here a little? Mark https://nextstrain.org core datasets as "official"? Or with some other label than "community"? (These are unification/integration pains, which feel hard, but also worth it.)

For me, the "Suggest automatically" feature of Nextclade was enabled by default (though it seems like a sticky preference?). With it enabled:

Load a Nextstrain dataset via the URL, e.g. https://nextclade-git-feat-fetch-auspice-sidecar-json-nextstrain.vercel.app/?dataset-json-url=https://nextstrain.org/ncov/gisaid/global/all-time
Load a file of sequences
Notice suggestions are made for an official dataset, but the one manually loaded via the URL remains selected.
Run the analysis.
Flip back to "Start".
Notice that the dataset has been switched out for one of the suggested ones.

This seems confusing?

ivan-aksamentov · 2024-05-30T19:18:59Z

@tsibley The second one sounds like a serious bug. I made a new issue: #1462

And for first one as well: #1463

Thanks for feedback!

Followup of: #1455

ivan-aksamentov added 3 commits May 13, 2024 12:50

feat: add ref and annotation data to Auspice tree types

0ec7adc

refactor: add pathogen nextclade extension to auspice tree type

1043b98

I had to derive a bunch of Eq and PartialEq traits to satisfy parent type requirements

feat: use Auspice JSON as dataset

4334f32

This allows to pass a path to Auspice JSON v2 to `--input-dataset` CLI argument. In this case we attempt to read not only tree, but also ref sequence, genome annotation and pathogen properties from that file, rather than from a conventional dataset.

vercel bot deployed to Preview May 16, 2024 12:48 View deployment

fix: parsing auspice genome annotations

b843ada

vercel bot deployed to Preview May 16, 2024 15:11 View deployment

jameshadfield mentioned this pull request May 17, 2024

Rn/use root of tree as reference nextstrain/avian-flu#33

Merged

ivan-aksamentov added 2 commits May 17, 2024 09:35

fix: off-by-one in landmark range

ff7e887

fix: duplicated start and end fields in the annotation of output tree

9b952bf

vercel bot deployed to Preview May 17, 2024 08:50 View deployment

ivan-aksamentov added 2 commits May 17, 2024 12:32

feat: accept Auspice JSON genome annotation in read-annotation command

48d163c

refactor: aggregate inputs loading

1fc4936

vercel bot deployed to Preview May 23, 2024 10:44 View deployment

ivan-aksamentov added 3 commits May 23, 2024 15:45

feat(web): add url parameterdataset-json-url

a27ee66

This allows to input Auspice JSON as Nextclade dataset to the web app.

Merge remote-tracking branch 'origin/master' into feat/ref-and-ann-fr…

fb029d5

…om-tree-json

fix(web): prevent crash when an auspice dataset was used in prev session

b1b3f5f

ivan-aksamentov marked this pull request as ready for review May 23, 2024 13:59

ivan-aksamentov requested a review from a team May 23, 2024 13:59

ivan-aksamentov mentioned this pull request May 23, 2024

add .meta.extensions.nextclade.pathogen nextstrain/nextclade_data#201

Draft

vercel bot deployed to Preview May 23, 2024 14:09 View deployment

ivan-aksamentov added 2 commits May 23, 2024 16:32

fix(web): prevent crash when auspice json has no .root_sequence

e5ee068

refactor: lint

883a0d6

vercel bot deployed to Preview May 23, 2024 14:40 View deployment

j23414 mentioned this pull request May 23, 2024

Set root reference in phylogenetic builds nextstrain/dengue#56

Closed

jameshadfield reviewed May 23, 2024

View reviewed changes

packages/nextclade-web/src/io/fetchSingleDatasetAuspice.ts Outdated Show resolved Hide resolved

j23414 mentioned this pull request May 23, 2024

Add workflow for producing the Nextclade dengue dataset nextstrain/dengue#21

Open

jameshadfield reviewed May 24, 2024

View reviewed changes

packages/nextclade-web/src/io/fetchSingleDatasetAuspice.ts Outdated Show resolved Hide resolved

jameshadfield reviewed May 24, 2024

View reviewed changes

packages/nextclade-web/src/io/fetchSingleDatasetAuspice.ts Outdated Show resolved Hide resolved

fix(web): specifically accept json

9f3c1e0

Let's add an explicit `Accept` HTTP header when fetching Auspice JSON. This is required for nextstrain.org links to work - the server sends different content depending on `Accept` header.

vercel bot deployed to Preview May 24, 2024 13:13 View deployment

ivan-aksamentov added 2 commits May 24, 2024 16:07

fix(web): don't error when ref missing from auspice json but is provi…

82e69a1

…ded otherwise

ivan-aksamentov force-pushed the feat/ref-and-ann-from-tree-json branch from e326b5d to 44fb8a5 Compare May 24, 2024 14:40

vercel bot deployed to Preview May 24, 2024 14:50 View deployment

ivan-aksamentov merged commit 9a172d1 into master May 29, 2024
20 checks passed

ivan-aksamentov deleted the feat/ref-and-ann-from-tree-json branch May 29, 2024 18:54

ivan-aksamentov mentioned this pull request May 30, 2024

feat: fetch sidecar Auspice JSON if .root_sequence is not on the tree #1460

Merged

ivan-aksamentov added a commit that referenced this pull request May 31, 2024

feat(cli): add Auspice JSON to CLI help and docs for --input-dataset

1255932

Followup of: #1455

ivan-aksamentov mentioned this pull request May 31, 2024

feat(cli): add Auspice JSON to CLI help and docs for --input-dataset #1468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use Auspice JSON as a dataset #1455

feat: use Auspice JSON as a dataset #1455

ivan-aksamentov commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited

ivan-aksamentov commented May 24, 2024 •

edited

rneher commented May 29, 2024

ivan-aksamentov commented May 29, 2024

ivan-aksamentov commented May 29, 2024 •

edited

jameshadfield commented May 29, 2024

ivan-aksamentov commented May 29, 2024 •

edited

jameshadfield commented May 29, 2024

tsibley commented May 29, 2024

jameshadfield commented May 29, 2024

ivan-aksamentov commented May 30, 2024

ivan-aksamentov commented May 30, 2024 •

edited

rneher commented May 30, 2024

tsibley commented May 30, 2024

tsibley commented May 30, 2024

ivan-aksamentov commented May 30, 2024 •

edited

feat: use Auspice JSON as a dataset #1455

feat: use Auspice JSON as a dataset #1455

Conversation

ivan-aksamentov commented May 16, 2024 • edited

Work items

Requirements

Data sources within Auspice JSON

Examples

Future work

vercel bot commented May 16, 2024 • edited

ivan-aksamentov commented May 24, 2024 • edited

rneher commented May 29, 2024

ivan-aksamentov commented May 29, 2024

ivan-aksamentov commented May 29, 2024 • edited

jameshadfield commented May 29, 2024

ivan-aksamentov commented May 29, 2024 • edited

jameshadfield commented May 29, 2024

tsibley commented May 29, 2024

jameshadfield commented May 29, 2024

ivan-aksamentov commented May 30, 2024

ivan-aksamentov commented May 30, 2024 • edited

rneher commented May 30, 2024

tsibley commented May 30, 2024

tsibley commented May 30, 2024

ivan-aksamentov commented May 30, 2024 • edited

ivan-aksamentov commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited

ivan-aksamentov commented May 24, 2024 •

edited

ivan-aksamentov commented May 29, 2024 •

edited

ivan-aksamentov commented May 29, 2024 •

edited

ivan-aksamentov commented May 30, 2024 •

edited

ivan-aksamentov commented May 30, 2024 •

edited