searching database for any duplicates genomes #3094

SAMtoBAM · 2024-03-21T19:20:40Z

Hi there

I have received genomes from numerous sources, some previously public, some not but I don't know which
So I have a set of genomes and want to see if any of them are identical to the larger complete public set of genomes

First, do you think sourmash be a suitable and fast option to determine this?
Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

Thanks a lot

ctb · 2024-03-21T19:25:12Z

First, do you think sourmash be a suitable and fast option to determine this?

Yes, I think so. Using sourmash you could find genomes that were 99.9% identical (or so) to things that are in a databases.

Second, would there be an appropriately stringent kmer and scale options for building the signature databases?

The default options (scaled=1000, k=31) should be good for identifying candidates up to about 99.9% ANI similarity, and you would be able to use our public databases (GTDB or NCBI) with those parameters. You might need to think about how to investigate further after you find near-identical matches, though; if you want to identify only perfect matches, you should do post-processing of the sourmash results.

Let me know if that doesn't make sense or you have more questions ;)

SAMtoBAM · 2024-03-21T19:32:17Z

Thanks for the quick response
Would increasing the kmer size to 51 or above help? considering the just want identical matches
I would create my own smaller signature database for the public genomes so I could modify the kmer size there too

ctb · 2024-03-22T03:53:18Z

oh, yes! then k=51, and/or lower scaled values (scaled=100, for example), would ensure perfect identity.

If only exact matches are needed, you can compare the md5sum of the signatures directly to find matches, without needing to do the search - if you do sourmash sig describe <sketchfiles>, or sketch everything to a zip file with sourmash sketch dna -p k=51,scaled=100 -o out.zip *.fa, and then do sourmash sig manifest out.zip -o out.mf.csv, you'll find md5 entries that will be the same if two sketches are the same.

HTH!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searching database for any duplicates genomes #3094

searching database for any duplicates genomes #3094

SAMtoBAM commented Mar 21, 2024

ctb commented Mar 21, 2024

SAMtoBAM commented Mar 21, 2024

ctb commented Mar 22, 2024

searching database for any duplicates genomes #3094

searching database for any duplicates genomes #3094

Comments

SAMtoBAM commented Mar 21, 2024

ctb commented Mar 21, 2024

SAMtoBAM commented Mar 21, 2024

ctb commented Mar 22, 2024