Application of `f_unique_to_query` and `threshold_bp` #3154

Amanda-Biocortex · 2024-05-13T10:14:14Z

Hi,

I am using Sourmash to profile bacteria composition and abundance in shotgun WGS stool samples, and have two questions:

Could you expand on what you mean by this statement with regards to the f_unique_to_query column?:
'This column should be used in any analysis that needs to avoid double-counting matches.'
Currently, I am using all the rows in the output table, am I double counting by not 'using' f_unique_to_query?

My current parameters are k=31, s=1000, threshold_bp=2000
In your experience will this low threshold return a very high number of false positives?

Many thanks

ctb · 2024-05-13T15:08:04Z

Could you expand on what you mean by this statement with regards to the f_unique_to_query column?:
'This column should be used in any analysis that needs to avoid double-counting matches.'
Currently, I am using all the rows in the output table, am I double counting by not 'using' f_unique_to_query?

great question!

The statement is in contrast to f_orig_query.

Suppose that you have a metagenome, and you find two matches, with the second match being to GCA_019057995.1 Candidatus Lokiarchaeota. f_orig_query is 0.10, while f_unique_to_query is 0.05.

This means that if you were only considering the Lokiarchaeota match, it would account for 10% of the unique k-mers in the metagenome. But since it's the second match, the first match may have also matched to some of those k-mers - and in this case, it did: the first match "consumed" half of the k-mers that would have matched to Lokiarchaeota, resulting in f_unique_to_query being 0.05 instead of 0.10.

This is analogous to saying, "how many reads would map to this genome?" (f_orig_query) vs "if I map all the reads to the first match, and then only map unmapped reads to the second match, how many reads would map to the second match?" (f_unique_to_query). If you used the former number, you would potentially count reads twice. If you use the latter number, you would only count each read once.

Coming back around to your original question: it depends on the analysis, butf_unique_to_query is the fraction of the original metagenome k-mers that you can assign to that specific matching genome. We use it in (for example) tax metagenome to assign estimates of how much of the metagenome belongs to a specific species.

I hope that's useful. It's complicated to explain and I'm afraid I didn't do a great job!

(I've also noticed a mistake in the docs - it says, "The fraction of matching hashes (unweighted) that are unique to this query; rank dependent." It's actually the fraction unique to this match. I'll fix.)

My current parameters are k=31, s=1000, threshold_bp=2000
In your experience will this low threshold return a very high number of false positives?

this is complicated - there's lots of discussion elsewhere, see #2360 for example.

My rule of thumb is that k=31, s=1000, and threshold-bp=3*s is good. So I'd recommend using 3000.

There's lots more to say here, but rather than thinking too hard about it, I'd suggest following up with mapping-based validation of a subset of your matches. That is, take your top 10 matches, and map reads to them. You should see good correspondence between what sourmash reports and what read mapping shows.

Please feel free to ask for more details! I have lots! I just don't want to overwhelm ;)

Amanda-Biocortex · 2024-05-13T15:35:28Z

Such a clear explanation of f_unique_query ! thank you!

Is the abundance metric calculated such that matches aren't doubled counted (ie abundance relates to f_unique_to_query as opposed to f_orig_query)? I'm solely relying on Sourmash for abundance calculation (ie I'm not using Sourmash to inform a 'minimum metagenome cover' for further alignment)

On a slightly different note, can you expand on the potential_false_positive column? Ive read in the documentation that this is to do with having a small sketch size, but I'm struggling to understand this given k and threshold_bp are set to establish a minimum similarity? Should I be ignoring matches where potential_false_positive=true?

Thanks again @ctb :)

ctb · 2024-05-13T17:21:52Z

Such a clear explanation of f_unique_query ! thank you!

Is the abundance metric calculated such that matches aren't doubled counted (ie abundance relates to f_unique_to_query as opposed to f_orig_query)? I'm solely relying on Sourmash for abundance calculation (ie I'm not using Sourmash to inform a 'minimum metagenome cover' for further alignment)

hah! You picked out an issue I didn't want to cover for fear of confusing you more ;).

f_unique_weighted is the abundance-weighted version that actually corresponds to how many reads will map, while f_unique_to_query is the overlap between the distinct k-mers in the query (metagenome) & the match; it is hard to explain its utility except as internal details :).
average_abund and median_abund are calculated using just the unique matches.

tl;dr use f_unique_weighted and average_abund/median_abund.

On a slightly different note, can you expand on the potential_false_positive column? Ive read in the documentation that this is to do with having a small sketch size, but I'm struggling to understand this given k and threshold_bp are set to establish a minimum similarity? Should I be ignoring matches where potential_false_positive=true?

Ahh! This is just about ANI estimates - from these docs,

True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise.

So it's about the internal estimation of ANI, not anything else about the match, if that makes sense.

Although, intuitively, if the match is too small to estimate ANI robustly, maybe that suggests the match itself isn't that robust... hmm. @bluegenes @dkoslicki thoughts?

Amanda-Biocortex · 2024-05-14T11:40:12Z

thanks!

@bluegenes @dkoslicki would be good to get youre thoughts on dropping matches that may be false positives.

Also, the column name is actually potential_false_negative, is it supposed to be potential_false_positive?

ctb · 2024-05-20T16:49:05Z

Also, the column name is actually potential_false_negative, is it supposed to be potential_false_positive?

hmm, I ... don't know. Are we doing double negatives here? 😭

@bluegenes your thoughts welcome!

Addresses #3154 (comment) - >(I've also noticed a mistake in the docs - it says, "The fraction of matching hashes (unweighted) that are unique to this query; rank dependent." It's actually the fraction unique to this match. I'll fix.)

ctb changed the title ~~Application of f_unique_to_query and threshold_bp~~ Application of f_unique_to_query and threshold_bp May 13, 2024

ctb mentioned this issue May 20, 2024

MRG: fix description of f_unique_weighted #3164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application of `f_unique_to_query` and `threshold_bp` #3154

Application of `f_unique_to_query` and `threshold_bp` #3154

Amanda-Biocortex commented May 13, 2024

ctb commented May 13, 2024

Amanda-Biocortex commented May 13, 2024

ctb commented May 13, 2024

Amanda-Biocortex commented May 14, 2024

ctb commented May 20, 2024

Application of f_unique_to_query and threshold_bp #3154

Application of f_unique_to_query and threshold_bp #3154

Comments

Amanda-Biocortex commented May 13, 2024

ctb commented May 13, 2024

Amanda-Biocortex commented May 13, 2024

ctb commented May 13, 2024

Amanda-Biocortex commented May 14, 2024

ctb commented May 20, 2024

Application of `f_unique_to_query` and `threshold_bp` #3154

Application of `f_unique_to_query` and `threshold_bp` #3154