-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
searching database for any duplicates genomes #3094
Comments
Yes, I think so. Using sourmash you could find genomes that were 99.9% identical (or so) to things that are in a databases.
The default options (scaled=1000, k=31) should be good for identifying candidates up to about 99.9% ANI similarity, and you would be able to use our public databases (GTDB or NCBI) with those parameters. You might need to think about how to investigate further after you find near-identical matches, though; if you want to identify only perfect matches, you should do post-processing of the sourmash results. Let me know if that doesn't make sense or you have more questions ;) |
Thanks for the quick response |
oh, yes! then k=51, and/or lower scaled values (scaled=100, for example), would ensure perfect identity. If only exact matches are needed, you can compare the md5sum of the signatures directly to find matches, without needing to do the search - if you do HTH! |
Hi there
I have received genomes from numerous sources, some previously public, some not but I don't know which
So I have a set of genomes and want to see if any of them are identical to the larger complete public set of genomes
First, do you think sourmash be a suitable and fast option to determine this?
Second, would there be an appropriately stringent kmer and scale options for building the signature databases?
Thanks a lot
The text was updated successfully, but these errors were encountered: