-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CE Align #4695
base: master
Are you sure you want to change the base?
Improve CE Align #4695
Conversation
… A is more similar than B
Given this is C code, @mdehoon could you take a look at this too please? |
Not forgotten, trying to find a nice way to benchmark/test this on a set of curated alignments. |
I was reading this review article of structure alignment today. Section 3.5 includes an evaluation of different pairwise structure alignment methods using different databases:
I will see if I can follow a similar approach to verify that my changes are beneficial. |
👍👍 That's what I was trying to do. I got a local copy of HOMSTRAD and
evaluators and was going to run there.
A sexta, 17/05/2024, 15:41, Will Tyler ***@***.***> escreveu:
… I was reading this review article
<https://www.sciencedirect.com/science/article/abs/pii/B9780128001684000056>
of structure alignment today. Section 3.5 includes an evaluation of
different pairwise structure alignment methods using different databases:
- conserved domain database <https://www.ncbi.nlm.nih.gov/cdd/>,
- Manual ALIgnments of DUPlicated domains (MALIDUP)
<http://prodata.swmed.edu/malidup/>,
- Manual ALIgnments of Structurally Analogous Motifs (MALISAM)
<http://prodata.swmed.edu/malisam/>,
- Sequence Alignment Benchmark (SABmark)
<https://academic.oup.com/bioinformatics/article/21/7/1267/268759?login=false>
.
I will see if I can follow a similar approach to verify that my changes
are beneficial.
—
Reply to this email directly, view it on GitHub
<#4695 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABZ6EH3SUYW4T5NYUBYVZLZCZMNJAVCNFSM6AAAAABGBH655GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGI3DENRWGA>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
I did some alignments using the MALIDUP database. Results"Improved" CE Align
Original CE Align
DiscussionThe RMSDs in the improved CE Align are smaller, but the lengths of the alignments are also smaller. As I understand it, it is better to find a longer alignment that satisfies the distance criteria even if the RMSD is worse. I will reexamine the implementation to see how to improve this. Currently, the implementation uses a circular buffer to store twenty long paths (not necessarily the longest). I will see if I can perhaps improve this situation by using a heap to store the twenty longest paths. |
Indeed - we want longer alignments, unless the difference is small and the RMSD decreases substantially. The drop in length is unfortunately quite substantial (average of 23 residues per alignment). Seems like the smaller alignments suffer the most, so maybe the changes you made to the scoring had some unfortunate side-effect? |
Thanks for the guidance—I updated my implementation to store the twenty longest paths that the algorithm encounters during its search in descending order. (I use a simple bubble sort instead of a heap since the maximum size of the buffer is bounded to 20 paths.) Then of the paths with maximum length, the Python code selects the one with the smallest RMSD. ResultsMALIDUPImproved CE Align
Original CE Align
MALISAMImproved CE Align
Original CE Align
DiscussionThe improved implementation and the original implementation give alignments that are the same lengths. I believe this is because of a bug in the original implementation that results in long paths being duplicated in the circular buffer. The original implementation uses the condition The RMSDs in the improved implementation are slightly smaller. This is most likely because my implementation is returning more unique paths for the Python code to choose from. As for trading off between alignment length and RMSD, I am not sure exactly how to do this. The CE Align paper says that "the twenty best paths at the end of the search are evaluated based upon RMSD and the best one is selected" (under Optimization of the final path). I interpret "best" as meaning longest. Guidance would be appreciated on this. |
Thanks for checking all this @Will-Tyler and thank you for catching that path duplication! I had seen it but thought it was a bug I had introduced in the code myself, and since it was still giving good results I didn't really care much about fixing it. Picking the longest alignment seems interesting. I wonder what was the criteria used by the authors? I cannot remember the paper anymore.. They also mention a ZScore but that strategy seems to have been abandoned in this implementation? |
During optimization of the final path (which is only done for alignments with a z-score above 3.5), the paper says the algorithm should select the path with the best RMSD from the "best" twenty paths. I think the authors mean the longest paths when they say "best". The algorithm then tries to change the location of the gaps and iteratively improve the alignment using DP to try to increase the alignment length while keeping the RMSD at about the same level. I think I need to read the referenced papers in the CE Align paper to fully understand the final optimization. For this PR, I am wondering if we can stick to the longest alignment with the best RMSD (which should be a small improvement over the existing implementation) and implement the final optimization in another PR. |
That makes sense! The original C code was doing only the rmsd optimization
(pick the 20 best paths, select the one with lowest rmsd). I carved that
out to do the rmsd part in Python since it was simpler to maintain at a
very small cost in performance.
So, if you fixed that bug with repeated paths and get lower rmsds with same
length this is great!! Happy to accept this and we can think of other
optimizations later on.
Thank you a lot Will!
A sexta, 24/05/2024, 22:02, Will Tyler ***@***.***> escreveu:
… During optimization of the final path (which is only done for alignments
with a z-score above 3.5), the paper says the algorithm should select the
path with the best RMSD from the "best" twenty paths. I think the authors
mean the longest paths when they say "best". The algorithm then tries to
change the location of the gaps and iteratively improve the alignment using
DP to try to increase the alignment length while keeping the RMSD at about
the same level.
I think I need to read the referenced papers in the CE Align paper to
fully understand the final optimization. For this PR, I am wondering if we
can stick to the longest alignment with the best RMSD (which should be a
small improvement over the existing implementation) and implement the final
optimization in another PR.
—
Reply to this email directly, view it on GitHub
<#4695 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABZ6EA3FVAQ76DBNCXRK33ZD7WLNAVCNFSM6AAAAABGBH655GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGY3DENZTGM>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
Thanks—in this pull request, I'd like to focus on code quality, correctness, and efficiency in a way that makes the CE Align code hopefully easier to maintain in the future. For code quality, which can be more subjective, I tried to make the code more understandable to someone who has the paper at hand and also improve adherence to PEP 7. Feel free to comment if there are any confusing variable/function names, and I can try to improve the code quality further. Also let me know if there are other tests/verification you would like to see. |
@@ -113,7 +113,10 @@ def align(self, structure, transform=True): | |||
# CEAlign returns the best N paths, where each path is a pair of lists | |||
# with aligned atom indices. Paths are not guaranteed to be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to update the docstring here since you fixed this bug :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And remove the set comprehension below, which is no longer necessary since all paths are indeed unique coming from the C code.
Sorry, got confused by the git diff. I see you removed the set comprehension. I'd instead remove the check to keep the paths only matching the longest length and simply try on all 20. This assume that the CEAlign code did a good job finding those 20 paths and matches the published algorithm (which did not really care for length at this point).
Here are the results of my changes after the latest update. (The results for the original CE Align implementation are in the earlier comments on this PR.) MALIDUP
MALISAM
|
Hm, seems like selecting the paths for length does have a significant effect. While the average (and median) stay relatively similar, we lose quite substantially on longer alignments. Sorry @Will-Tyler , would you mind reverting that and put what you had before? After that I'm more than happy to merge this. |
876afcf
to
a049e55
Compare
Should be reverted now! |
Acknowledgements
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Description
In this pull request, I refactor, optimize, and simplify the implementation of CE Align in Biopython. I committed my changes in small increments in case one would like to see how my changes evolved.
Some notable changes are:
Testing
The original unit tests still pass.
I tested my changes by aligning various human protein kinases with Aurora A kinase as the reference structure. I found a list of human kinases in this paper.
Here are the kinase PDB codes:
Testing Script
Original CE Align
Improved CE Align
The execution time is similar with better RMSD scores in the improved implementation of CE Align.
Math
Note that equation (11), which is the average of all$D_{ij}$ where $i$ and $j$ are AFPs in the path, can be simplified if we assume that $D_{ij}$ is symmetric:
To get the recursive form of equation (11) in the CE Align paper:
To get the closed form of the number of terms in the summation, we calculate:
References