Improve CE Align #4695

Will-Tyler · 2024-04-10T22:36:40Z

Acknowledgements

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

Description

In this pull request, I refactor, optimize, and simplify the implementation of CE Align in Biopython. I committed my changes in small increments in case one would like to see how my changes evolved.

Some notable changes are:

the removal of extra space and unnecessary buffers,
negation of "similarity" or "scores" so that a higher score indicates more similarity,
changing how the total path similarity is calculated to match the CE align paper.

Testing

The original unit tests still pass.

I tested my changes by aligning various human protein kinases with Aurora A kinase as the reference structure. I found a list of human kinases in this paper.

Here are the kinase PDB codes:

['6NPZB', '3E8DB', '5OTFA', '2VD5A', '5UVCA', '4YHJB', '4TNDA', '2ACXA', '5LOHA', '2BIYA', '4OTGA', '4CRSA', '6C0UA', '4RA4A', '2I0EB', '3TXOA', '3A8XA', '5F9EB', '6C0TA', '5WNED', '5U7RB', '2Z7RA', '4NW6A', '1VZOA', '6G78A', '3WF9A', '3HDNA', '4FR4F', '6BXIB', '6R4DA', '4AF3A', '6GR9A', '2JC6C', '2JAMB', '4FG7A', '2VZ6B', '3BHHB', '6AYWB', '2V7OA', '2W4OA', '6CD6A', '6CMJB', '3TACA', '6FCKA', '4A9UA', '6FHBA', '2CKED', '5A6NA', '5JZNA', '3M42A', '3R1NA', '2HAKF', '3IECC', '3FE3B', '5ES1A', '4UMPB', '2HW6A', '2AC3A', '2X4FB', '3DLSA', '2Y7JC', '6NO9A', '2IWIA', '5TA8A', '4I6FA', '4B6LA', '3COKB', '6C9JA', '6B2EA', '4NIFD', '4JG8A', '3KN6B', '5YKSB', '2WTKC', '3LM5A', '5L2QC', '5CEMA', '4JNWB', '6GZDA', '6GZMB', '4HOKM', '2CMWA', '2C47D', '6GROA', '4NFNA', '6CNXA', '6NCGA', '2JIIB', '5ACBC', '5EFQC', '5G6VA', '6GU4A', '6GUFC', '3G33C', '3O0GA', '1XO2B', '6O9L8', '5XS2A', '4OR5F', '4AGUC', '4AAAA', '3ZDUA', '4BGQA', '6RAAA', '6FYLA', '6RCTB', '6FYVA', '6Q4QB', '6QY9A', '4YU2D', '5LXDB', '5Y86A', '6H0UB', '4H39A', '3GC8B', '1CM8B', '4MYGB', '1WBWA', '6QAWA', '6GESB', '5O7IA', '3O2MB', '3E7OA', '4IJPA', '1WAKA', '5MYVA', '4B9DB', '5M53A', '4WSQB', '4W9WA', '6F7BA', '4F9BA', '5EBZL', '2A1AB', '4X7LA', '4U6RA', '4Y8DB', '6G3AA', '4KIKB', '4MWIA', '6BHCA', '5VD3A', '5VE6A', '4OAUC', '2BUJA', '6O8BB', '5O0YA', '5AP1A', '5CI7A', '6QAVC', '6FDYU', '5VD7A', '5VDKA', '5WE8B', '5O21B', '4ANBA', '1S9IA', '3ALOA', '3VN9A', '6QHRA', '4IDVD', '5VIOD', '5IU2B', '6CQDB', '5J5TA', '4ZK5A', '3DAKD', '5KBQA', '6FD3A', '5ZJWA', '2F57B', '4KS8A', '2JFMA', '6HXFC', '4W8EA', '2XIKA', '4FZAB', '5DH3A', '3COMB', '2WTKB', '6BDNA', '5AX9C', '6GIPA', '4ASXB', '2QLUA', '3MY0O', '3MDYC', '3G2FA', '4MNEB', '6MIBA', '6BFNB', '2NRUD', '5KKRB', '5HVKA', '4TPTA', '5CENA', '5X5OA', '4UYAA', '5JGDA', '4UY9B', '3OMVB', '4ITIA', '5NG0B', '6B8YA', '5QINA', '6B5JC', '4TWPB', '2XYNC', '3LCTA', '5U6BD', '3SXSA', '1K2PA', '3LCDA', '3D7UC', '6FIOA', '6FERF', '5XGNA', '4P2KA', '3FXXA', '2R2PA', '2REIA', '3KULB', '5MJBB', '3ZFMA', '5L6PA', '3ZEWA', '3PP0B', '3KEXA', '3BCEC', '3CBLA', '3KXXB', '5UI0B', '4K33A', '6JPJA', '3HNGA', '4XUFB', '2DQ7X', '2HK5A', '3QQUC', '4XLVA', '3QGYB', '4L00B', '6C7YA', '6M9HA', '5TQ8A', '6GL9B', '3CJGA', '1PKGA', '3LCKA', '3A4OX', '3BRBA', '4IWDA', '3PLSA', '5KVTA', '4AT4A', '3V5QB', '5GRNA', '4H1MA', '2ETMA', '6CZ4A', '6FEKA', '4GT4A', '4UXLA', '1YI6B', '5CXHA', '2OSCA', '3EQPB', '3ZONA', '4GVJA', '1U59A']

Testing Script

import os.path
import time
import pandas as pd
from Bio.PDB import PDBList, MMCIFParser, CEAligner


df = pd.read_excel('data/41598_2019_56499_MOESM5_ESM.xlsx')
pdb_codes = df['10_PDB_validation'].dropna()
pdb_list = PDBList()

pdb_list.download_pdb_files([pdb_code[:4] for pdb_code in pdb_codes], pdir='data/pdb')
pdb_codes = [pdb_code for pdb_code in pdb_codes if os.path.exists(f'data/pdb/{pdb_code[:4].lower()}.cif')]

parser = MMCIFParser()
# I downloaded 3E5A manually from RCSB.
aurora_a = parser.get_structure('3E5A', 'data/pdb/3e5a.cif')

aligner = CEAligner()
aligner.set_reference(aurora_a)
alignments = []

start_time = time.time()

for pdb_code in pdb_codes:
    structure = parser.get_structure(pdb_code[:4].upper(), f'data/pdb/{pdb_code[:4].lower()}.cif')
    aligner.align(structure)
    alignments.append({
        'PDB Code': pdb_code[:4],
        'RMSD': aligner.rms,
    })

print(f'Time: {time.time() - start_time}')

alignments = pd.DataFrame(alignments)
print(alignments['RMSD'].describe())

Original CE Align

Time: 64.08160495758057
count    270.000000
mean       3.205020
std        0.748432
min        1.043961
25%        2.639412
50%        3.146602
75%        3.759536
max        5.427452
Name: RMSD, dtype: float64

Improved CE Align

Time: 62.18356680870056
count    270.000000
mean       2.951104
std        0.618955
min        1.041904
25%        2.546669
50%        2.946635
75%        3.419845
max        4.352589
Name: RMSD, dtype: float64

The execution time is similar with better RMSD scores in the improved implementation of CE Align.

Math

Note that equation (11), which is the average of all $D_{ij}$ where $i$ and $j$ are AFPs in the path, can be simplified if we assume that $D_{ij}$ is symmetric:

$$\frac{1}{(n+1)^2} \sum_{i=0}^n \sum_{j=0}^n D_{ij} = \frac{1}{n + 1 + \frac{n(n+1)}{2}} \sum_{i=0}^n \sum_{j=i}^n D_{ij}.$$

To get the recursive form of equation (11) in the CE Align paper:

$$\begin{align*} \sum_{i=0}^n \sum_{j=i}^n D_{ij} &= \sum_{i=0}^n \left( \sum_{j=i}^{n-1} D_{ij} + D_{in} \right) \\\ &= \sum_{i=0}^n \sum_{j=i}^{n-1} D_{ij} + \sum_{i=0}^n D_{in} \\\ &= \sum_{i=0}^{n-1} \sum_{j=i}^{n-1} D_{ij} + \sum_{i=0}^n D_{in}. \end{align*}$$

To get the closed form of the number of terms in the summation, we calculate:

$$\begin{align*} \sum_{i=0}^n \sum_{j=i}^n 1 &= \sum_{i=0}^n (n - i + 1) \\\ &= (n+1)^2 - \sum_{i=0}^n i \\\ &= n^2 + 2n + 1 - \frac{n(n+1)}{2} \\\ &= n^2 + 2n + 1 - \frac{n^2}{2} - \frac{n}{2} \\\ &= \frac{n^2}{2} + \frac{3n}{2} + 1 \\\ &= n + 1 + \frac{n(n+1)}{2}. \end{align*}$$

References

Protein structure alignment by incremental combinatorial extension (CE) of the optimal path
- Main paper describing the CE Align algorithm.
A Structurally-Validated Multiple Sequence Alignment of 497 Human Protein Kinase Domains
- Source of PDB codes for human protein kinases.

… A is more similar than B

peterjc · 2024-04-11T12:29:44Z

Given this is C code, @mdehoon could you take a look at this too please?

JoaoRodrigues · 2024-04-28T20:04:29Z

Not forgotten, trying to find a nice way to benchmark/test this on a set of curated alignments.

Will-Tyler · 2024-05-17T19:40:46Z

I was reading this review article of structure alignment today. Section 3.5 includes an evaluation of different pairwise structure alignment methods using different databases:

I will see if I can follow a similar approach to verify that my changes are beneficial.

JoaoRodrigues · 2024-05-17T23:03:45Z

👍👍 That's what I was trying to do. I got a local copy of HOMSTRAD and evaluators and was going to run there. A sexta, 17/05/2024, 15:41, Will Tyler ***@***.***> escreveu:

…

I was reading this review article <https://www.sciencedirect.com/science/article/abs/pii/B9780128001684000056> of structure alignment today. Section 3.5 includes an evaluation of different pairwise structure alignment methods using different databases: - conserved domain database <https://www.ncbi.nlm.nih.gov/cdd/>, - Manual ALIgnments of DUPlicated domains (MALIDUP) <http://prodata.swmed.edu/malidup/>, - Manual ALIgnments of Structurally Analogous Motifs (MALISAM) <http://prodata.swmed.edu/malisam/>, - Sequence Alignment Benchmark (SABmark) <https://academic.oup.com/bioinformatics/article/21/7/1267/268759?login=false> . I will see if I can follow a similar approach to verify that my changes are beneficial. — Reply to this email directly, view it on GitHub <#4695 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABZ6EH3SUYW4T5NYUBYVZLZCZMNJAVCNFSM6AAAAABGBH655GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGI3DENRWGA> . You are receiving this because your review was requested.Message ID: ***@***.***>

Will-Tyler · 2024-05-22T00:30:07Z

I did some alignments using the MALIDUP database.

Results

"Improved" CE Align

count    241.000000
mean       2.939407
std        1.067218
min        0.628612
25%        2.122899
50%        2.926898
75%        3.757896
max        5.769939
Name: RMSD, dtype: float64

count    241.000000
mean      60.680498
std       45.806859
min        8.000000
25%       32.000000
50%       48.000000
75%       80.000000
max      256.000000
Name: Length, dtype: float64

Original CE Align

count    241.000000
mean       3.525818
std        1.311128
min        0.628612
25%        2.593499
50%        3.489225
75%        4.562064
max        7.300840
Name: RMSD, dtype: float64

count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

Discussion

The RMSDs in the improved CE Align are smaller, but the lengths of the alignments are also smaller. As I understand it, it is better to find a longer alignment that satisfies the distance criteria even if the RMSD is worse. I will reexamine the implementation to see how to improve this.

Currently, the implementation uses a circular buffer to store twenty long paths (not necessarily the longest). I will see if I can perhaps improve this situation by using a heap to store the twenty longest paths.

JoaoRodrigues · 2024-05-23T23:18:43Z

Indeed - we want longer alignments, unless the difference is small and the RMSD decreases substantially. The drop in length is unfortunately quite substantial (average of 23 residues per alignment). Seems like the smaller alignments suffer the most, so maybe the changes you made to the scoring had some unfortunate side-effect?

Will-Tyler · 2024-05-24T01:19:08Z

Thanks for the guidance—I updated my implementation to store the twenty longest paths that the algorithm encounters during its search in descending order. (I use a simple bubble sort instead of a heap since the maximum size of the buffer is bounded to 20 paths.)

Then of the paths with maximum length, the Python code selects the one with the smallest RMSD.

Results

MALIDUP

Improved CE Align

count    241.000000
mean       3.449998
std        1.290120
min        0.628612
25%        2.570221
50%        3.416172
75%        4.406737
max        7.300840
Name: RMSD, dtype: float64

count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

Original CE Align

count    241.000000
mean       3.525818
std        1.311128
min        0.628612
25%        2.593499
50%        3.489225
75%        4.562064
max        7.300840
Name: RMSD, dtype: float64

count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

MALISAM

Improved CE Align

count    130.000000
mean       3.946192
std        0.699545
min        2.480055
25%        3.409361
50%        3.872238
75%        4.476527
max        5.590935
Name: RMSD, dtype: float64

count    130.000000
mean      61.292308
std       12.932410
min       24.000000
25%       56.000000
50%       56.000000
75%       72.000000
max      112.000000
Name: Length, dtype: float64

Original CE Align

count    130.000000
mean       4.070752
std        0.797337
min        2.480055
25%        3.447943
50%        4.055692
75%        4.568788
max        7.136314
Name: RMSD, dtype: float64

count    130.000000
mean      61.292308
std       12.932410
min       24.000000
25%       56.000000
50%       56.000000
75%       72.000000
max      112.000000
Name: Length, dtype: float64

Discussion

The improved implementation and the original implementation give alignments that are the same lengths. I believe this is because of a bug in the original implementation that results in long paths being duplicated in the circular buffer. The original implementation uses the condition bestPathLength > lenBuffer[bufferIndex] || (bestPathLength == lenBuffer[bufferIndex] && bestPathScore < scoreBuffer[bufferIndex]) to decide whether to insert the best path into the buffer. The "best path" variables are not necessarily updated in subsequent iterations, and the implementation stores the best path again. This results in the buffer often only having one unique path, one of the longest ones encountered, at the end of the algorithm. My changes at the time of writing improves this by storing more unique paths to choose from.

The RMSDs in the improved implementation are slightly smaller. This is most likely because my implementation is returning more unique paths for the Python code to choose from.

As for trading off between alignment length and RMSD, I am not sure exactly how to do this. The CE Align paper says that "the twenty best paths at the end of the search are evaluated based upon RMSD and the best one is selected" (under Optimization of the final path). I interpret "best" as meaning longest. Guidance would be appreciated on this.

JoaoRodrigues · 2024-05-24T22:30:33Z

Thanks for checking all this @Will-Tyler and thank you for catching that path duplication! I had seen it but thought it was a bug I had introduced in the code myself, and since it was still giving good results I didn't really care much about fixing it.

Picking the longest alignment seems interesting. I wonder what was the criteria used by the authors? I cannot remember the paper anymore.. They also mention a ZScore but that strategy seems to have been abandoned in this implementation?

Will-Tyler · 2024-05-25T02:02:08Z

During optimization of the final path (which is only done for alignments with a z-score above 3.5), the paper says the algorithm should select the path with the best RMSD from the "best" twenty paths. I think the authors mean the longest paths when they say "best". The algorithm then tries to change the location of the gaps and iteratively improve the alignment using DP to try to increase the alignment length while keeping the RMSD at about the same level.

I think I need to read the referenced papers in the CE Align paper to fully understand the final optimization. For this PR, I am wondering if we can stick to the longest alignment with the best RMSD (which should be a small improvement over the existing implementation) and implement the final optimization in another PR.

JoaoRodrigues · 2024-05-28T22:06:34Z

That makes sense! The original C code was doing only the rmsd optimization (pick the 20 best paths, select the one with lowest rmsd). I carved that out to do the rmsd part in Python since it was simpler to maintain at a very small cost in performance. So, if you fixed that bug with repeated paths and get lower rmsds with same length this is great!! Happy to accept this and we can think of other optimizations later on. Thank you a lot Will! A sexta, 24/05/2024, 22:02, Will Tyler ***@***.***> escreveu:

…

During optimization of the final path (which is only done for alignments with a z-score above 3.5), the paper says the algorithm should select the path with the best RMSD from the "best" twenty paths. I think the authors mean the longest paths when they say "best". The algorithm then tries to change the location of the gaps and iteratively improve the alignment using DP to try to increase the alignment length while keeping the RMSD at about the same level. I think I need to read the referenced papers in the CE Align paper to fully understand the final optimization. For this PR, I am wondering if we can stick to the longest alignment with the best RMSD (which should be a small improvement over the existing implementation) and implement the final optimization in another PR. — Reply to this email directly, view it on GitHub <#4695 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABZ6EA3FVAQ76DBNCXRK33ZD7WLNAVCNFSM6AAAAABGBH655GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGY3DENZTGM> . You are receiving this because your review was requested.Message ID: ***@***.***>

Will-Tyler · 2024-05-29T00:29:04Z

Thanks—in this pull request, I'd like to focus on code quality, correctness, and efficiency in a way that makes the CE Align code hopefully easier to maintain in the future. For code quality, which can be more subjective, I tried to make the code more understandable to someone who has the paper at hand and also improve adherence to PEP 7. Feel free to comment if there are any confusing variable/function names, and I can try to improve the code quality further. Also let me know if there are other tests/verification you would like to see.

JoaoRodrigues · 2024-05-29T03:01:26Z

Bio/PDB/cealign.py

@@ -113,7 +113,10 @@ def align(self, structure, transform=True):
        # CEAlign returns the best N paths, where each path is a pair of lists
        # with aligned atom indices. Paths are not guaranteed to be unique.


You might want to update the docstring here since you fixed this bug :)

~~And remove the set comprehension below, which is no longer necessary since all paths are indeed unique coming from the C code.~~

Sorry, got confused by the git diff. I see you removed the set comprehension. I'd instead remove the check to keep the paths only matching the longest length and simply try on all 20. This assume that the CEAlign code did a good job finding those 20 paths and matches the published algorithm (which did not really care for length at this point).

Will-Tyler · 2024-05-30T01:27:32Z

Here are the results of my changes after the latest update. (The results for the original CE Align implementation are in the earlier comments on this PR.)

MALIDUP

count    241.000000
mean       2.970778
std        1.048138
min        0.592707
25%        2.276167
50%        2.999706
75%        3.721987
max        5.715258
Name: RMSD, dtype: float64

count    241.000000
mean      76.149378
std       40.431559
min       16.000000
25%       48.000000
50%       64.000000
75%       96.000000
max      256.000000
Name: Length, dtype: float64

MALISAM

count    130.000000
mean       3.474840
std        0.589693
min        2.140568
25%        3.017552
50%        3.448652
75%        3.878683
max        5.110265
Name: RMSD, dtype: float64

count    130.000000
mean      54.400000
std       11.928469
min       24.000000
25%       48.000000
50%       56.000000
75%       62.000000
max       88.000000
Name: Length, dtype: float64

JoaoRodrigues · 2024-06-03T14:58:19Z

Hm, seems like selecting the paths for length does have a significant effect. While the average (and median) stay relatively similar, we lose quite substantially on longer alignments.

Sorry @Will-Tyler , would you mind reverting that and put what you had before? After that I'm more than happy to merge this.

Will-Tyler · 2024-06-04T18:55:30Z

Should be reverted now!

Will-Tyler added 22 commits April 5, 2024 03:00

Make C functions static, some reformatting

31fd607

Refactor calcDM, calcS

cda494b

Conserve space in similarity matrix

a92042e

Introduce similarityI

9267203

Misc changes

8b188e2

Refactor path total score calculation

9f6f638

Remove done variable

918f5a5

Remove redundant best path variables

55c80e1

Rename window size variables to fragment size

f7ce0d5

Rename score to similarity

d002f3d

Remove redundant bounds check

620d577

Make similarity non-positive so that similarity(A) > similarity(B) if…

5df2e94

… A is more similar than B

Simplify all similarity buffer

3716145

Remove all similarity buffer

f731200

Minor simplifications

1759998

Use designated initializers for structs

53df93e

Use AFP struct in similarity parameters

26763b5

Refactor AFP struct

8c8eb31

Minor optimization

e8ad80f

Add todo for using heap

be8e37e

Simplify current path similarity calculation

a0c7ea3

Format lines within 79 characters

097559c

Will-Tyler marked this pull request as ready for review April 11, 2024 02:40

Will-Tyler requested a review from JoaoRodrigues as a code owner April 11, 2024 02:40

Choose a path with the longest length

a049e55

JoaoRodrigues approved these changes May 29, 2024

View reviewed changes

Will-Tyler force-pushed the improve-ce-align branch from 876afcf to a049e55 Compare June 4, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CE Align #4695

Improve CE Align #4695

Will-Tyler commented Apr 10, 2024 •

edited

peterjc commented Apr 11, 2024

JoaoRodrigues commented Apr 28, 2024

Will-Tyler commented May 17, 2024

JoaoRodrigues commented May 17, 2024 via email

Will-Tyler commented May 22, 2024

JoaoRodrigues commented May 23, 2024

Will-Tyler commented May 24, 2024

JoaoRodrigues commented May 24, 2024

Will-Tyler commented May 25, 2024

JoaoRodrigues commented May 28, 2024 via email

Will-Tyler commented May 29, 2024

JoaoRodrigues May 29, 2024

JoaoRodrigues May 29, 2024 •

edited

Will-Tyler commented May 30, 2024

JoaoRodrigues commented Jun 3, 2024

Will-Tyler commented Jun 4, 2024

		@@ -113,7 +113,10 @@ def align(self, structure, transform=True):
		# CEAlign returns the best N paths, where each path is a pair of lists
		# with aligned atom indices. Paths are not guaranteed to be unique.

Improve CE Align #4695

Are you sure you want to change the base?

Improve CE Align #4695

Conversation

Will-Tyler commented Apr 10, 2024 • edited

Acknowledgements

Description

Testing

Testing Script

Original CE Align

Improved CE Align

Math

References

peterjc commented Apr 11, 2024

JoaoRodrigues commented Apr 28, 2024

Will-Tyler commented May 17, 2024

JoaoRodrigues commented May 17, 2024 via email

Will-Tyler commented May 22, 2024

Results

"Improved" CE Align

Original CE Align

Discussion

JoaoRodrigues commented May 23, 2024

Will-Tyler commented May 24, 2024

Results

MALIDUP

Improved CE Align

Original CE Align

MALISAM

Improved CE Align

Original CE Align

Discussion

JoaoRodrigues commented May 24, 2024

Will-Tyler commented May 25, 2024

JoaoRodrigues commented May 28, 2024 via email

Will-Tyler commented May 29, 2024

JoaoRodrigues May 29, 2024

Choose a reason for hiding this comment

JoaoRodrigues May 29, 2024 • edited

Choose a reason for hiding this comment

Will-Tyler commented May 30, 2024

MALIDUP

MALISAM

JoaoRodrigues commented Jun 3, 2024

Will-Tyler commented Jun 4, 2024

Will-Tyler commented Apr 10, 2024 •

edited

JoaoRodrigues May 29, 2024 •

edited