Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporation of enterovirus dataset into nextalde docker container #1238

Open
laura-bankers opened this issue Aug 15, 2023 · 1 comment
Open
Labels
t:feat Type: request of a new feature, functionality, enchancement

Comments

@laura-bankers
Copy link

Hello,

We are developing a bioinformatics workflow for EV-D68 WGS for public health surveillance to be run on Terra.bio. There appears to be an enterovirus nextclade dataset in the github repo, however, it is not available in the most recent docker container. We would love to be able to use nextclade for clade assignment. Would it be possible to get this dataset added to the container available on dockstore?

Thanks,
Laura

@laura-bankers laura-bankers added good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment t:feat Type: request of a new feature, functionality, enchancement labels Aug 15, 2023
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Aug 15, 2023

Hi Laura @laura-bankers,

Thanks for your interest! We are very happy that people are reaching out and asking about new pathogens.

Do you mean these files?

https://github.com/nextstrain/nextclade/tree/master/data/enterovirus/d68

Sadly, these are only a genome annotation, a reference sequence and a few example sequences, so they are not enough to run Nextclade (which also currently requires a reference tree, QC config and virus properties config). These files are historically only there to provide some examples to run Nextalign (which is like Nextclade, but only does alignment and translation).

Or maybe you've seen other files somewhere else? Could you please send me a link?

I don't exclude a possibility that there are datasets exist on the internet, created by the community and which we don't know about.


A few notes which may help you in your work with Nextclade:

Dockstore containers is not something Nextclade team is aware of. This is not an official source. Probably some community effort. Which we are happy to hear about, but don't have bandwidth to support officially.

Official docker containers (on DockerHub) or any other official means of distribution of Nextclade CLI (listed in the docs) don't contain datasets on purpose. Nextclade is pathogen-agnostic by design. It only reads an index.json file hosted elsewhere on our servers, which contains a list of known datasets, and then can download datasets from this list from our server using nextclade dataset get command. This is purely for convenience. But you can also load any dataset you want from your computer. So, if you found a dataset you like, or created one, you can just pass it into Nextclade as you would do with an officially downloaded one.

You can try and build your own dataset to support a new pathogen. It's quite a challenging adventure at the time. But I gathered some of the information in response to this issue in hope that it helps people: #1225

We are working on the next major version of Nextclade - version 3. In the new version there will be significant changes to datasets. Nextalign will be removed and all dataset files previously required for Nextclade will become optional - this way you could build a dataset gradually, starting small and adding new features later as needed. And we are also hoping to document creation of new datasets better and t provide tools to make the process easier. It's all coming soon. Stay tuned!

@ivan-aksamentov ivan-aksamentov removed good first issue Good for newcomers help wanted Extra attention is needed needs triage Mark for review and label assignment labels Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:feat Type: request of a new feature, functionality, enchancement
Projects
No open projects
Development

No branches or pull requests

2 participants