Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow qblast to accept lists of organisms to include/exclude #4516

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

netogallo
Copy link
Contributor

  • [ X] I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • [ X] I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • [ X] I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Allow qblast to accept an explicit list of organisms to include/exclude. That this is already possible to a limited degree by using the entrez query. However, I have a use case which requires me to build a complex list of organisms to exclude/include, which also queries qblast multiple times (with similar but not identical lists of organisms). This feature makes my use case easier to implement and prevents blast from having to do the entrez query as I already provide the correct names and taxids of the organisms.

Let me know if you are open to include this in biopython, I am happy to write a test if so.

- organisms A dictionary that defines the organisms that will be
included/excluded in the search. The key is the name
of the organism, following the taxonomy convention
ie. "Bacteria (taxid:2)" and the value is a boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie -> eg

I assume (please confirm) this works with any taxonomy node like "Bacteria" and not just leaf nodes (like a specific species)?

Copy link
Contributor Author

@netogallo netogallo Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my use cases, it is indeed the case. And it does set difference between domains, eg.

{ "Bacteria (taxid:2)": False, "E. coli (taxid:562)": True }

Will give you results that are bacteria but not "E. coli". This particular behavior is what I am relying on for my use case.

If you decide to go forward with these changes, I will write a few tests to validate these properties.


if ORGANISM_REGEX.match(organism) is None:
raise ValueError(
"Organisms must be specified following the taxonomy convention. ie. 'Bacteria (taxid:2)'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie -> eg

@peterjc
Copy link
Member

peterjc commented Nov 28, 2023

Is this explained in the latest qblast documentation from the NCBI?

@netogallo
Copy link
Contributor Author

Unfortunately, I could not find any official documentation for this functionality. I worked it out by inspecting the web requests performed by the "NCBI Blast" web page and reverse engineered the functionality. For this reason, I can understand you might not want to include the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants