Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient indexing into Bio.Align.substitution_matrices.Array during pairwise2 alignment #4701

Open
kamurani opened this issue Apr 12, 2024 · 3 comments
Assignees

Comments

@kamurani
Copy link
Contributor

kamurani commented Apr 12, 2024

System info

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)
3.8.19 (default, Mar 20 2024, 19:58:24) 
[GCC 11.2.0]
CPython
Linux-6.5.0-26-generic-x86_64-with-glibc2.17
1.79
from Bio import pairwise2
from Bio.Align import substitution_matrices

seq1 = "MKKCTILVVASLLLVNSLLPGYGQNKIIQAQRNLNELCYNEGNDNKLYHVLNSKNGKIYNRNTVNRLLPMLRRKKNEKKNEKIERNNKLKQPPPPPNPNDPPPPNPNDPPPPNPNDPPPPNPNDPPPPNANDPPPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPPPPNPNDPAPPQGNNNPQPQPRPQPQPQPQPQPQPQPQPQPRPQPQPQPGGNNNNKNNNNDDSYIPSAEKILEFVKQIRDSITEEWSQCNVTCGSGIRVRKRKGSNKKAEDLTLEDIDTEICKMDKCSSIFNIVSNSLGFVILLVLVFFN"
seq2 = "MNYCKTTFHIFFFVLFFITIYEIKCQLRFASLGDWGKDTKGQILNAKYFKQFIKNERVTFIVSPGSNFIDGVKGLNDPAWKNLYEDVYSEEKGDMYMPFFTVLGTRDWTGNYNAQLLKGQGIYIEKNGETSIEKDADATNYPKWIMPNYWYHYFTHFTVSSGPSIVKTGHKDLAAAFIFIDTWVLSSNFPYKKIHEKAWNDLKSQLSVAKKIADFIIVVGDQPIYSSGYSRGSSYLAYYLLPLLKDAEVDLYISGHDNNMEVIEDNDMAHITCGSGSMSQGKSGMKNSKSLFFSSDIGFCVHELSNNGIVTKFVSSKKGEVIYTHKLNIKKKKTLDKVNALQHFAALPNVELTDVPSSGPMGNKDTFVRVVGTIGILIGSVIVFIGASSFLSKNMK"

len(seq1), len(seq2) 
# (339, 396) 

gap_open = -11
gap_extend = -1
sm = substitution_matrices.load("BLOSUM62")

# This runs in 1.4 seconds 
alignments = pairwise2.align.globalds(
            seq1, seq2, dict(sm), gap_open, gap_extend)

# This hangs for several minutes until kernel is killed
alignments = pairwise2.align.globalds(
            seq1, seq2, sm, gap_open, gap_extend)

When using the globalds sequence alignment function, I notice that providing a Bio.Align.substitution_matrices.Array object results in a much larger time to get an output than when providing it as a python dictionary (i.e. same format as the now-deprecated from Bio.SubsMat import MatrixInfo; MatrixInfo.blosum62).

For short sequences (i.e. < 20 AAs), there is no noticeable difference.

However, when running global alignment on sequences of length (339, 396), the dict version runs within 2 seconds but the non-dict version indefinitely runs (several minutes before I Ctrl+D).

Not sure if this is expected behaviour or not; I figured I would post an issue in case there's something going on that shouldn't be.

@kamurani kamurani changed the title Inefficient indexing into `Align Inefficient indexing into Bio.Align.substitution_matrices.Array during `pairwise2' alignment Apr 12, 2024
@kamurani kamurani changed the title Inefficient indexing into Bio.Align.substitution_matrices.Array during `pairwise2' alignment Inefficient indexing into Bio.Align.substitution_matrices.Array during pairwise2 alignment Apr 12, 2024
@mdehoon
Copy link
Contributor

mdehoon commented Apr 12, 2024

@kamurani Can you use the newer PairwiseAligner in Bio.Align instead?

@mdehoon
Copy link
Contributor

mdehoon commented Apr 12, 2024

@kamurani

The following runs in 0.12 seconds:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.open_gap_score = -11
>>> aligner.extend_gap_score = -1
>>> from Bio.Align import substitution_matrices
>>> sm = substitution_matrices.load("BLOSUM62")
>>> seq1 = "MKKCTILVVASLLLVNSLLPGYGQNKIIQAQRNLNELCYNEGNDNKLYHVLNSKNGKIYNRNTVNRLLPMLRRKKNEKKNEKIERNNKLKQPPPPPNPNDPPPPNPNDPPPPNPNDPPPPNPNDPPPPNANDPPPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPPPPNPNDPAPPQGNNNPQPQPRPQPQPQPQPQPQPQPQPQPRPQPQPQPGGNNNNKNNNNDDSYIPSAEKILEFVKQIRDSITEEWSQCNVTCGSGIRVRKRKGSNKKAEDLTLEDIDTEICKMDKCSSIFNIVSNSLGFVILLVLVFFN"
>>> seq2 = "MNYCKTTFHIFFFVLFFITIYEIKCQLRFASLGDWGKDTKGQILNAKYFKQFIKNERVTFIVSPGSNFIDGVKGLNDPAWKNLYEDVYSEEKGDMYMPFFTVLGTRDWTGNYNAQLLKGQGIYIEKNGETSIEKDADATNYPKWIMPNYWYHYFTHFTVSSGPSIVKTGHKDLAAAFIFIDTWVLSSNFPYKKIHEKAWNDLKSQLSVAKKIADFIIVVGDQPIYSSGYSRGSSYLAYYLLPLLKDAEVDLYISGHDNNMEVIEDNDMAHITCGSGSMSQGKSGMKNSKSLFFSSDIGFCVHELSNNGIVTKFVSSKKGEVIYTHKLNIKKKKTLDKVNALQHFAALPNVELTDVPSSGPMGNKDTFVRVVGTIGILIGSVIVFIGASSFLSKNMK"
>>> aligner.substitution_matrix = sm
>>> alignments = aligner.align(seq1, seq2)
>>> alignment = next(alignments)
>>> print(alignment)

@MarkusPiotrowski
Copy link
Contributor

Could be that providing a Bio.Align.substitution_matrices.Array forces pairwise2 to use the Python alignment code instead of the much faster C code.
Since pairwise2 is deprecated, I don't think that we will address this in the pairwise2 code. Moving to PairwiseAligner is the recommended step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants