Inefficient indexing into `Bio.Align.substitution_matrices.Array` during `pairwise2` alignment #4701

kamurani · 2024-04-12T04:14:14Z

System info

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)

3.8.19 (default, Mar 20 2024, 19:58:24) 
[GCC 11.2.0]
CPython
Linux-6.5.0-26-generic-x86_64-with-glibc2.17
1.79

from Bio import pairwise2
from Bio.Align import substitution_matrices

seq1 = "MKKCTILVVASLLLVNSLLPGYGQNKIIQAQRNLNELCYNEGNDNKLYHVLNSKNGKIYNRNTVNRLLPMLRRKKNEKKNEKIERNNKLKQPPPPPNPNDPPPPNPNDPPPPNPNDPPPPNPNDPPPPNANDPPPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPPPPNPNDPAPPQGNNNPQPQPRPQPQPQPQPQPQPQPQPQPRPQPQPQPGGNNNNKNNNNDDSYIPSAEKILEFVKQIRDSITEEWSQCNVTCGSGIRVRKRKGSNKKAEDLTLEDIDTEICKMDKCSSIFNIVSNSLGFVILLVLVFFN"
seq2 = "MNYCKTTFHIFFFVLFFITIYEIKCQLRFASLGDWGKDTKGQILNAKYFKQFIKNERVTFIVSPGSNFIDGVKGLNDPAWKNLYEDVYSEEKGDMYMPFFTVLGTRDWTGNYNAQLLKGQGIYIEKNGETSIEKDADATNYPKWIMPNYWYHYFTHFTVSSGPSIVKTGHKDLAAAFIFIDTWVLSSNFPYKKIHEKAWNDLKSQLSVAKKIADFIIVVGDQPIYSSGYSRGSSYLAYYLLPLLKDAEVDLYISGHDNNMEVIEDNDMAHITCGSGSMSQGKSGMKNSKSLFFSSDIGFCVHELSNNGIVTKFVSSKKGEVIYTHKLNIKKKKTLDKVNALQHFAALPNVELTDVPSSGPMGNKDTFVRVVGTIGILIGSVIVFIGASSFLSKNMK"

len(seq1), len(seq2) 
# (339, 396) 

gap_open = -11
gap_extend = -1
sm = substitution_matrices.load("BLOSUM62")

# This runs in 1.4 seconds 
alignments = pairwise2.align.globalds(
            seq1, seq2, dict(sm), gap_open, gap_extend)

# This hangs for several minutes until kernel is killed
alignments = pairwise2.align.globalds(
            seq1, seq2, sm, gap_open, gap_extend)

When using the globalds sequence alignment function, I notice that providing a Bio.Align.substitution_matrices.Array object results in a much larger time to get an output than when providing it as a python dictionary (i.e. same format as the now-deprecated from Bio.SubsMat import MatrixInfo; MatrixInfo.blosum62).

For short sequences (i.e. < 20 AAs), there is no noticeable difference.

However, when running global alignment on sequences of length (339, 396), the dict version runs within 2 seconds but the non-dict version indefinitely runs (several minutes before I Ctrl+D).

Not sure if this is expected behaviour or not; I figured I would post an issue in case there's something going on that shouldn't be.

The text was updated successfully, but these errors were encountered:

mdehoon · 2024-04-12T09:35:40Z

@kamurani Can you use the newer PairwiseAligner in Bio.Align instead?

mdehoon · 2024-04-12T09:41:31Z

@kamurani

The following runs in 0.12 seconds:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.open_gap_score = -11
>>> aligner.extend_gap_score = -1
>>> from Bio.Align import substitution_matrices
>>> sm = substitution_matrices.load("BLOSUM62")
>>> seq1 = "MKKCTILVVASLLLVNSLLPGYGQNKIIQAQRNLNELCYNEGNDNKLYHVLNSKNGKIYNRNTVNRLLPMLRRKKNEKKNEKIERNNKLKQPPPPPNPNDPPPPNPNDPPPPNPNDPPPPNPNDPPPPNANDPPPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPAPPNANDPPPPNPNDPAPPQGNNNPQPQPRPQPQPQPQPQPQPQPQPQPRPQPQPQPGGNNNNKNNNNDDSYIPSAEKILEFVKQIRDSITEEWSQCNVTCGSGIRVRKRKGSNKKAEDLTLEDIDTEICKMDKCSSIFNIVSNSLGFVILLVLVFFN"
>>> seq2 = "MNYCKTTFHIFFFVLFFITIYEIKCQLRFASLGDWGKDTKGQILNAKYFKQFIKNERVTFIVSPGSNFIDGVKGLNDPAWKNLYEDVYSEEKGDMYMPFFTVLGTRDWTGNYNAQLLKGQGIYIEKNGETSIEKDADATNYPKWIMPNYWYHYFTHFTVSSGPSIVKTGHKDLAAAFIFIDTWVLSSNFPYKKIHEKAWNDLKSQLSVAKKIADFIIVVGDQPIYSSGYSRGSSYLAYYLLPLLKDAEVDLYISGHDNNMEVIEDNDMAHITCGSGSMSQGKSGMKNSKSLFFSSDIGFCVHELSNNGIVTKFVSSKKGEVIYTHKLNIKKKKTLDKVNALQHFAALPNVELTDVPSSGPMGNKDTFVRVVGTIGILIGSVIVFIGASSFLSKNMK"
>>> aligner.substitution_matrix = sm
>>> alignments = aligner.align(seq1, seq2)
>>> alignment = next(alignments)
>>> print(alignment)

MarkusPiotrowski · 2024-04-12T09:44:26Z

Could be that providing a Bio.Align.substitution_matrices.Array forces pairwise2 to use the Python alignment code instead of the much faster C code.
Since pairwise2 is deprecated, I don't think that we will address this in the pairwise2 code. Moving to PairwiseAligner is the recommended step.

kamurani changed the title ~~Inefficient indexing into `Align~~ Inefficient indexing into Bio.Align.substitution_matrices.Array during `pairwise2' alignment Apr 12, 2024

kamurani changed the title ~~Inefficient indexing into Bio.Align.substitution_matrices.Array during `pairwise2' alignment~~ Inefficient indexing into Bio.Align.substitution_matrices.Array during pairwise2 alignment Apr 12, 2024

kamurani mentioned this issue Apr 12, 2024

Fix BiopythonDeprecationWarning during sequence alignment phbradley/TCRdock#16

Open

peterjc assigned mdehoon Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient indexing into `Bio.Align.substitution_matrices.Array` during `pairwise2` alignment #4701

Inefficient indexing into `Bio.Align.substitution_matrices.Array` during `pairwise2` alignment #4701

kamurani commented Apr 12, 2024 •

edited

mdehoon commented Apr 12, 2024

mdehoon commented Apr 12, 2024

MarkusPiotrowski commented Apr 12, 2024

Inefficient indexing into Bio.Align.substitution_matrices.Array during pairwise2 alignment #4701

Inefficient indexing into Bio.Align.substitution_matrices.Array during pairwise2 alignment #4701

Comments

kamurani commented Apr 12, 2024 • edited

System info

mdehoon commented Apr 12, 2024

mdehoon commented Apr 12, 2024

MarkusPiotrowski commented Apr 12, 2024

Inefficient indexing into `Bio.Align.substitution_matrices.Array` during `pairwise2` alignment #4701

Inefficient indexing into `Bio.Align.substitution_matrices.Array` during `pairwise2` alignment #4701

kamurani commented Apr 12, 2024 •

edited