Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CE Align #4695

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Apr 10, 2024

Acknowledgements

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Description

In this pull request, I refactor, optimize, and simplify the implementation of CE Align in Biopython. I committed my changes in small increments in case one would like to see how my changes evolved.

Some notable changes are:

  • the removal of extra space and unnecessary buffers,
  • negation of "similarity" or "scores" so that a higher score indicates more similarity,
  • changing how the total path similarity is calculated to match the CE align paper.

Testing

The original unit tests still pass.

I tested my changes by aligning various human protein kinases with Aurora A kinase as the reference structure. I found a list of human kinases in this paper.

Here are the kinase PDB codes:

['6NPZB', '3E8DB', '5OTFA', '2VD5A', '5UVCA', '4YHJB', '4TNDA', '2ACXA', '5LOHA', '2BIYA', '4OTGA', '4CRSA', '6C0UA', '4RA4A', '2I0EB', '3TXOA', '3A8XA', '5F9EB', '6C0TA', '5WNED', '5U7RB', '2Z7RA', '4NW6A', '1VZOA', '6G78A', '3WF9A', '3HDNA', '4FR4F', '6BXIB', '6R4DA', '4AF3A', '6GR9A', '2JC6C', '2JAMB', '4FG7A', '2VZ6B', '3BHHB', '6AYWB', '2V7OA', '2W4OA', '6CD6A', '6CMJB', '3TACA', '6FCKA', '4A9UA', '6FHBA', '2CKED', '5A6NA', '5JZNA', '3M42A', '3R1NA', '2HAKF', '3IECC', '3FE3B', '5ES1A', '4UMPB', '2HW6A', '2AC3A', '2X4FB', '3DLSA', '2Y7JC', '6NO9A', '2IWIA', '5TA8A', '4I6FA', '4B6LA', '3COKB', '6C9JA', '6B2EA', '4NIFD', '4JG8A', '3KN6B', '5YKSB', '2WTKC', '3LM5A', '5L2QC', '5CEMA', '4JNWB', '6GZDA', '6GZMB', '4HOKM', '2CMWA', '2C47D', '6GROA', '4NFNA', '6CNXA', '6NCGA', '2JIIB', '5ACBC', '5EFQC', '5G6VA', '6GU4A', '6GUFC', '3G33C', '3O0GA', '1XO2B', '6O9L8', '5XS2A', '4OR5F', '4AGUC', '4AAAA', '3ZDUA', '4BGQA', '6RAAA', '6FYLA', '6RCTB', '6FYVA', '6Q4QB', '6QY9A', '4YU2D', '5LXDB', '5Y86A', '6H0UB', '4H39A', '3GC8B', '1CM8B', '4MYGB', '1WBWA', '6QAWA', '6GESB', '5O7IA', '3O2MB', '3E7OA', '4IJPA', '1WAKA', '5MYVA', '4B9DB', '5M53A', '4WSQB', '4W9WA', '6F7BA', '4F9BA', '5EBZL', '2A1AB', '4X7LA', '4U6RA', '4Y8DB', '6G3AA', '4KIKB', '4MWIA', '6BHCA', '5VD3A', '5VE6A', '4OAUC', '2BUJA', '6O8BB', '5O0YA', '5AP1A', '5CI7A', '6QAVC', '6FDYU', '5VD7A', '5VDKA', '5WE8B', '5O21B', '4ANBA', '1S9IA', '3ALOA', '3VN9A', '6QHRA', '4IDVD', '5VIOD', '5IU2B', '6CQDB', '5J5TA', '4ZK5A', '3DAKD', '5KBQA', '6FD3A', '5ZJWA', '2F57B', '4KS8A', '2JFMA', '6HXFC', '4W8EA', '2XIKA', '4FZAB', '5DH3A', '3COMB', '2WTKB', '6BDNA', '5AX9C', '6GIPA', '4ASXB', '2QLUA', '3MY0O', '3MDYC', '3G2FA', '4MNEB', '6MIBA', '6BFNB', '2NRUD', '5KKRB', '5HVKA', '4TPTA', '5CENA', '5X5OA', '4UYAA', '5JGDA', '4UY9B', '3OMVB', '4ITIA', '5NG0B', '6B8YA', '5QINA', '6B5JC', '4TWPB', '2XYNC', '3LCTA', '5U6BD', '3SXSA', '1K2PA', '3LCDA', '3D7UC', '6FIOA', '6FERF', '5XGNA', '4P2KA', '3FXXA', '2R2PA', '2REIA', '3KULB', '5MJBB', '3ZFMA', '5L6PA', '3ZEWA', '3PP0B', '3KEXA', '3BCEC', '3CBLA', '3KXXB', '5UI0B', '4K33A', '6JPJA', '3HNGA', '4XUFB', '2DQ7X', '2HK5A', '3QQUC', '4XLVA', '3QGYB', '4L00B', '6C7YA', '6M9HA', '5TQ8A', '6GL9B', '3CJGA', '1PKGA', '3LCKA', '3A4OX', '3BRBA', '4IWDA', '3PLSA', '5KVTA', '4AT4A', '3V5QB', '5GRNA', '4H1MA', '2ETMA', '6CZ4A', '6FEKA', '4GT4A', '4UXLA', '1YI6B', '5CXHA', '2OSCA', '3EQPB', '3ZONA', '4GVJA', '1U59A']

Testing Script

import os.path
import time
import pandas as pd
from Bio.PDB import PDBList, MMCIFParser, CEAligner


df = pd.read_excel('data/41598_2019_56499_MOESM5_ESM.xlsx')
pdb_codes = df['10_PDB_validation'].dropna()
pdb_list = PDBList()

pdb_list.download_pdb_files([pdb_code[:4] for pdb_code in pdb_codes], pdir='data/pdb')
pdb_codes = [pdb_code for pdb_code in pdb_codes if os.path.exists(f'data/pdb/{pdb_code[:4].lower()}.cif')]

parser = MMCIFParser()
# I downloaded 3E5A manually from RCSB.
aurora_a = parser.get_structure('3E5A', 'data/pdb/3e5a.cif')

aligner = CEAligner()
aligner.set_reference(aurora_a)
alignments = []

start_time = time.time()

for pdb_code in pdb_codes:
    structure = parser.get_structure(pdb_code[:4].upper(), f'data/pdb/{pdb_code[:4].lower()}.cif')
    aligner.align(structure)
    alignments.append({
        'PDB Code': pdb_code[:4],
        'RMSD': aligner.rms,
    })

print(f'Time: {time.time() - start_time}')

alignments = pd.DataFrame(alignments)
print(alignments['RMSD'].describe())

Original CE Align

Time: 64.08160495758057
count    270.000000
mean       3.205020
std        0.748432
min        1.043961
25%        2.639412
50%        3.146602
75%        3.759536
max        5.427452
Name: RMSD, dtype: float64

Improved CE Align

Time: 62.18356680870056
count    270.000000
mean       2.951104
std        0.618955
min        1.041904
25%        2.546669
50%        2.946635
75%        3.419845
max        4.352589
Name: RMSD, dtype: float64

The execution time is similar with better RMSD scores in the improved implementation of CE Align.

Math

Note that equation (11), which is the average of all $D_{ij}$ where $i$ and $j$ are AFPs in the path, can be simplified if we assume that $D_{ij}$ is symmetric:

$$\frac{1}{(n+1)^2} \sum_{i=0}^n \sum_{j=0}^n D_{ij} = \frac{1}{n + 1 + \frac{n(n+1)}{2}} \sum_{i=0}^n \sum_{j=i}^n D_{ij}.$$

To get the recursive form of equation (11) in the CE Align paper:

$$\begin{align*} \sum_{i=0}^n \sum_{j=i}^n D_{ij} &= \sum_{i=0}^n \left( \sum_{j=i}^{n-1} D_{ij} + D_{in} \right) \\\ &= \sum_{i=0}^n \sum_{j=i}^{n-1} D_{ij} + \sum_{i=0}^n D_{in} \\\ &= \sum_{i=0}^{n-1} \sum_{j=i}^{n-1} D_{ij} + \sum_{i=0}^n D_{in}. \end{align*}$$

To get the closed form of the number of terms in the summation, we calculate:

$$\begin{align*} \sum_{i=0}^n \sum_{j=i}^n 1 &= \sum_{i=0}^n (n - i + 1) \\\ &= (n+1)^2 - \sum_{i=0}^n i \\\ &= n^2 + 2n + 1 - \frac{n(n+1)}{2} \\\ &= n^2 + 2n + 1 - \frac{n^2}{2} - \frac{n}{2} \\\ &= \frac{n^2}{2} + \frac{3n}{2} + 1 \\\ &= n + 1 + \frac{n(n+1)}{2}. \end{align*}$$

References

@Will-Tyler Will-Tyler marked this pull request as ready for review April 11, 2024 02:40
@peterjc
Copy link
Member

peterjc commented Apr 11, 2024

Given this is C code, @mdehoon could you take a look at this too please?

@JoaoRodrigues
Copy link
Member

Not forgotten, trying to find a nice way to benchmark/test this on a set of curated alignments.

@Will-Tyler
Copy link
Contributor Author

I was reading this review article of structure alignment today. Section 3.5 includes an evaluation of different pairwise structure alignment methods using different databases:

I will see if I can follow a similar approach to verify that my changes are beneficial.

@JoaoRodrigues
Copy link
Member

JoaoRodrigues commented May 17, 2024 via email

@Will-Tyler
Copy link
Contributor Author

I did some alignments using the MALIDUP database.

Results

"Improved" CE Align

count    241.000000
mean       2.939407
std        1.067218
min        0.628612
25%        2.122899
50%        2.926898
75%        3.757896
max        5.769939
Name: RMSD, dtype: float64
count    241.000000
mean      60.680498
std       45.806859
min        8.000000
25%       32.000000
50%       48.000000
75%       80.000000
max      256.000000
Name: Length, dtype: float64

Original CE Align

count    241.000000
mean       3.525818
std        1.311128
min        0.628612
25%        2.593499
50%        3.489225
75%        4.562064
max        7.300840
Name: RMSD, dtype: float64
count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

Discussion

The RMSDs in the improved CE Align are smaller, but the lengths of the alignments are also smaller. As I understand it, it is better to find a longer alignment that satisfies the distance criteria even if the RMSD is worse. I will reexamine the implementation to see how to improve this.

Currently, the implementation uses a circular buffer to store twenty long paths (not necessarily the longest). I will see if I can perhaps improve this situation by using a heap to store the twenty longest paths.

@JoaoRodrigues
Copy link
Member

Indeed - we want longer alignments, unless the difference is small and the RMSD decreases substantially. The drop in length is unfortunately quite substantial (average of 23 residues per alignment). Seems like the smaller alignments suffer the most, so maybe the changes you made to the scoring had some unfortunate side-effect?

@Will-Tyler
Copy link
Contributor Author

Thanks for the guidance—I updated my implementation to store the twenty longest paths that the algorithm encounters during its search in descending order. (I use a simple bubble sort instead of a heap since the maximum size of the buffer is bounded to 20 paths.)

Then of the paths with maximum length, the Python code selects the one with the smallest RMSD.

Results

MALIDUP

Improved CE Align

count    241.000000
mean       3.449998
std        1.290120
min        0.628612
25%        2.570221
50%        3.416172
75%        4.406737
max        7.300840
Name: RMSD, dtype: float64
count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

Original CE Align

count    241.000000
mean       3.525818
std        1.311128
min        0.628612
25%        2.593499
50%        3.489225
75%        4.562064
max        7.300840
Name: RMSD, dtype: float64
count    241.000000
mean      83.352697
std       41.113411
min       24.000000
25%       56.000000
50%       72.000000
75%      104.000000
max      256.000000
Name: Length, dtype: float64

MALISAM

Improved CE Align

count    130.000000
mean       3.946192
std        0.699545
min        2.480055
25%        3.409361
50%        3.872238
75%        4.476527
max        5.590935
Name: RMSD, dtype: float64
count    130.000000
mean      61.292308
std       12.932410
min       24.000000
25%       56.000000
50%       56.000000
75%       72.000000
max      112.000000
Name: Length, dtype: float64

Original CE Align

count    130.000000
mean       4.070752
std        0.797337
min        2.480055
25%        3.447943
50%        4.055692
75%        4.568788
max        7.136314
Name: RMSD, dtype: float64
count    130.000000
mean      61.292308
std       12.932410
min       24.000000
25%       56.000000
50%       56.000000
75%       72.000000
max      112.000000
Name: Length, dtype: float64

Discussion

The improved implementation and the original implementation give alignments that are the same lengths. I believe this is because of a bug in the original implementation that results in long paths being duplicated in the circular buffer. The original implementation uses the condition bestPathLength > lenBuffer[bufferIndex] || (bestPathLength == lenBuffer[bufferIndex] && bestPathScore < scoreBuffer[bufferIndex]) to decide whether to insert the best path into the buffer. The "best path" variables are not necessarily updated in subsequent iterations, and the implementation stores the best path again. This results in the buffer often only having one unique path, one of the longest ones encountered, at the end of the algorithm. My changes at the time of writing improves this by storing more unique paths to choose from.

The RMSDs in the improved implementation are slightly smaller. This is most likely because my implementation is returning more unique paths for the Python code to choose from.

As for trading off between alignment length and RMSD, I am not sure exactly how to do this. The CE Align paper says that "the twenty best paths at the end of the search are evaluated based upon RMSD and the best one is selected" (under Optimization of the final path). I interpret "best" as meaning longest. Guidance would be appreciated on this.

@JoaoRodrigues
Copy link
Member

Thanks for checking all this @Will-Tyler and thank you for catching that path duplication! I had seen it but thought it was a bug I had introduced in the code myself, and since it was still giving good results I didn't really care much about fixing it.

Picking the longest alignment seems interesting. I wonder what was the criteria used by the authors? I cannot remember the paper anymore.. They also mention a ZScore but that strategy seems to have been abandoned in this implementation?

@Will-Tyler
Copy link
Contributor Author

During optimization of the final path (which is only done for alignments with a z-score above 3.5), the paper says the algorithm should select the path with the best RMSD from the "best" twenty paths. I think the authors mean the longest paths when they say "best". The algorithm then tries to change the location of the gaps and iteratively improve the alignment using DP to try to increase the alignment length while keeping the RMSD at about the same level.

I think I need to read the referenced papers in the CE Align paper to fully understand the final optimization. For this PR, I am wondering if we can stick to the longest alignment with the best RMSD (which should be a small improvement over the existing implementation) and implement the final optimization in another PR.

@JoaoRodrigues
Copy link
Member

JoaoRodrigues commented May 28, 2024 via email

@Will-Tyler
Copy link
Contributor Author

Thanks—in this pull request, I'd like to focus on code quality, correctness, and efficiency in a way that makes the CE Align code hopefully easier to maintain in the future. For code quality, which can be more subjective, I tried to make the code more understandable to someone who has the paper at hand and also improve adherence to PEP 7. Feel free to comment if there are any confusing variable/function names, and I can try to improve the code quality further. Also let me know if there are other tests/verification you would like to see.

@@ -113,7 +113,10 @@ def align(self, structure, transform=True):
# CEAlign returns the best N paths, where each path is a pair of lists
# with aligned atom indices. Paths are not guaranteed to be unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to update the docstring here since you fixed this bug :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And remove the set comprehension below, which is no longer necessary since all paths are indeed unique coming from the C code.

Sorry, got confused by the git diff. I see you removed the set comprehension. I'd instead remove the check to keep the paths only matching the longest length and simply try on all 20. This assume that the CEAlign code did a good job finding those 20 paths and matches the published algorithm (which did not really care for length at this point).

@Will-Tyler
Copy link
Contributor Author

Here are the results of my changes after the latest update. (The results for the original CE Align implementation are in the earlier comments on this PR.)

MALIDUP

count    241.000000
mean       2.970778
std        1.048138
min        0.592707
25%        2.276167
50%        2.999706
75%        3.721987
max        5.715258
Name: RMSD, dtype: float64
count    241.000000
mean      76.149378
std       40.431559
min       16.000000
25%       48.000000
50%       64.000000
75%       96.000000
max      256.000000
Name: Length, dtype: float64

MALISAM

count    130.000000
mean       3.474840
std        0.589693
min        2.140568
25%        3.017552
50%        3.448652
75%        3.878683
max        5.110265
Name: RMSD, dtype: float64
count    130.000000
mean      54.400000
std       11.928469
min       24.000000
25%       48.000000
50%       56.000000
75%       62.000000
max       88.000000
Name: Length, dtype: float64

@JoaoRodrigues
Copy link
Member

Hm, seems like selecting the paths for length does have a significant effect. While the average (and median) stay relatively similar, we lose quite substantially on longer alignments.

Sorry @Will-Tyler , would you mind reverting that and put what you had before? After that I'm more than happy to merge this.

@Will-Tyler
Copy link
Contributor Author

Should be reverted now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants