Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searching database for any duplicates genomes #3094

Open
SAMtoBAM opened this issue Mar 21, 2024 · 3 comments
Open

searching database for any duplicates genomes #3094

SAMtoBAM opened this issue Mar 21, 2024 · 3 comments

Comments

@SAMtoBAM
Copy link

Hi there

I have received genomes from numerous sources, some previously public, some not but I don't know which
So I have a set of genomes and want to see if any of them are identical to the larger complete public set of genomes

First, do you think sourmash be a suitable and fast option to determine this?
Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

Thanks a lot

@ctb
Copy link
Contributor

ctb commented Mar 21, 2024

First, do you think sourmash be a suitable and fast option to determine this?

Yes, I think so. Using sourmash you could find genomes that were 99.9% identical (or so) to things that are in a databases.

Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

The default options (scaled=1000, k=31) should be good for identifying candidates up to about 99.9% ANI similarity, and you would be able to use our public databases (GTDB or NCBI) with those parameters. You might need to think about how to investigate further after you find near-identical matches, though; if you want to identify only perfect matches, you should do post-processing of the sourmash results.

Let me know if that doesn't make sense or you have more questions ;)

@SAMtoBAM
Copy link
Author

Thanks for the quick response
Would increasing the kmer size to 51 or above help? considering the just want identical matches
I would create my own smaller signature database for the public genomes so I could modify the kmer size there too

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

oh, yes! then k=51, and/or lower scaled values (scaled=100, for example), would ensure perfect identity.

If only exact matches are needed, you can compare the md5sum of the signatures directly to find matches, without needing to do the search - if you do sourmash sig describe <sketchfiles>, or sketch everything to a zip file with sourmash sketch dna -p k=51,scaled=100 -o out.zip *.fa, and then do sourmash sig manifest out.zip -o out.mf.csv, you'll find md5 entries that will be the same if two sketches are the same.

HTH!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants