Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application of f_unique_to_query and threshold_bp #3154

Open
Amanda-Biocortex opened this issue May 13, 2024 · 5 comments
Open

Application of f_unique_to_query and threshold_bp #3154

Amanda-Biocortex opened this issue May 13, 2024 · 5 comments

Comments

@Amanda-Biocortex
Copy link

Hi,

I am using Sourmash to profile bacteria composition and abundance in shotgun WGS stool samples, and have two questions:

Could you expand on what you mean by this statement with regards to the f_unique_to_query column?:
'This column should be used in any analysis that needs to avoid double-counting matches.'
Currently, I am using all the rows in the output table, am I double counting by not 'using' f_unique_to_query?

My current parameters are k=31, s=1000, threshold_bp=2000
In your experience will this low threshold return a very high number of false positives?

Many thanks

@ctb ctb changed the title Application of f_unique_to_query and threshold_bp Application of f_unique_to_query and threshold_bp May 13, 2024
@ctb
Copy link
Contributor

ctb commented May 13, 2024

Could you expand on what you mean by this statement with regards to the f_unique_to_query column?:
'This column should be used in any analysis that needs to avoid double-counting matches.'
Currently, I am using all the rows in the output table, am I double counting by not 'using' f_unique_to_query?

great question!

The statement is in contrast to f_orig_query.

Suppose that you have a metagenome, and you find two matches, with the second match being to GCA_019057995.1 Candidatus Lokiarchaeota. f_orig_query is 0.10, while f_unique_to_query is 0.05.

This means that if you were only considering the Lokiarchaeota match, it would account for 10% of the unique k-mers in the metagenome. But since it's the second match, the first match may have also matched to some of those k-mers - and in this case, it did: the first match "consumed" half of the k-mers that would have matched to Lokiarchaeota, resulting in f_unique_to_query being 0.05 instead of 0.10.

This is analogous to saying, "how many reads would map to this genome?" (f_orig_query) vs "if I map all the reads to the first match, and then only map unmapped reads to the second match, how many reads would map to the second match?" (f_unique_to_query). If you used the former number, you would potentially count reads twice. If you use the latter number, you would only count each read once.

Coming back around to your original question: it depends on the analysis, butf_unique_to_query is the fraction of the original metagenome k-mers that you can assign to that specific matching genome. We use it in (for example) tax metagenome to assign estimates of how much of the metagenome belongs to a specific species.

I hope that's useful. It's complicated to explain and I'm afraid I didn't do a great job!

(I've also noticed a mistake in the docs - it says, "The fraction of matching hashes (unweighted) that are unique to this query; rank dependent." It's actually the fraction unique to this match. I'll fix.)

My current parameters are k=31, s=1000, threshold_bp=2000
In your experience will this low threshold return a very high number of false positives?

this is complicated - there's lots of discussion elsewhere, see #2360 for example.

My rule of thumb is that k=31, s=1000, and threshold-bp=3*s is good. So I'd recommend using 3000.

There's lots more to say here, but rather than thinking too hard about it, I'd suggest following up with mapping-based validation of a subset of your matches. That is, take your top 10 matches, and map reads to them. You should see good correspondence between what sourmash reports and what read mapping shows.

Please feel free to ask for more details! I have lots! I just don't want to overwhelm ;)

@Amanda-Biocortex
Copy link
Author

Such a clear explanation of f_unique_query ! thank you!

Is the abundance metric calculated such that matches aren't doubled counted (ie abundance relates to f_unique_to_query as opposed to f_orig_query)? I'm solely relying on Sourmash for abundance calculation (ie I'm not using Sourmash to inform a 'minimum metagenome cover' for further alignment)

On a slightly different note, can you expand on the potential_false_positive column? Ive read in the documentation that this is to do with having a small sketch size, but I'm struggling to understand this given k and threshold_bp are set to establish a minimum similarity? Should I be ignoring matches where potential_false_positive=true?

Thanks again @ctb :)

@ctb
Copy link
Contributor

ctb commented May 13, 2024

Such a clear explanation of f_unique_query ! thank you!

Is the abundance metric calculated such that matches aren't doubled counted (ie abundance relates to f_unique_to_query as opposed to f_orig_query)? I'm solely relying on Sourmash for abundance calculation (ie I'm not using Sourmash to inform a 'minimum metagenome cover' for further alignment)

hah! You picked out an issue I didn't want to cover for fear of confusing you more ;).

f_unique_weighted is the abundance-weighted version that actually corresponds to how many reads will map, while f_unique_to_query is the overlap between the distinct k-mers in the query (metagenome) & the match; it is hard to explain its utility except as internal details :).
average_abund and median_abund are calculated using just the unique matches.

tl;dr use f_unique_weighted and average_abund/median_abund.

On a slightly different note, can you expand on the potential_false_positive column? Ive read in the documentation that this is to do with having a small sketch size, but I'm struggling to understand this given k and threshold_bp are set to establish a minimum similarity? Should I be ignoring matches where potential_false_positive=true?

Ahh! This is just about ANI estimates - from these docs,

True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise.

So it's about the internal estimation of ANI, not anything else about the match, if that makes sense.

Although, intuitively, if the match is too small to estimate ANI robustly, maybe that suggests the match itself isn't that robust... hmm. @bluegenes @dkoslicki thoughts?

@Amanda-Biocortex
Copy link
Author

thanks!

@bluegenes @dkoslicki would be good to get youre thoughts on dropping matches that may be false positives.

Also, the column name is actually potential_false_negative, is it supposed to be potential_false_positive?

@ctb
Copy link
Contributor

ctb commented May 20, 2024

Also, the column name is actually potential_false_negative, is it supposed to be potential_false_positive?

hmm, I ... don't know. Are we doing double negatives here? 😭

@bluegenes your thoughts welcome!

ctb added a commit that referenced this issue May 28, 2024
Addresses
#3154 (comment)
-

>(I've also noticed a mistake in the docs - it says, "The fraction of
matching hashes (unweighted) that are unique to this query; rank
dependent." It's actually the fraction unique to this match. I'll fix.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants