`MultipleSeqAlignment` in `Bio.Align` should have a method for translating positions between sequences and the alignment #4248

kamurani · 2023-03-01T00:32:34Z

Setup

>>> import sys; print(sys.version)
3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
macOS-10.16-x86_64-i386-64bit
>>> import Bio; print(Bio.__version__)
1.82.dev0

Expected behaviour

It would be very useful to have an inbuilt way for a MultipleSeqAlignment to calculate the positions in the original sequences (i.e. the original letter indexes, without the gap - characters); given a position in the alignment (or another sequence).

As an example use case, I am aligning protein sequences from different PDB files and wish to compare homologous residues in 3D space. I need to know the original position of a particular residue that has aligned to another, in order to index into the PDB files separately.

>>> from Bio import AlignIO
>>> align = AlignIO.read("Clustalw/opuntia.aln", "clustal")
>>> print(align)
Alignment with 3 rows and 9 columns
KTLK-E-ME Alpha
KTLK---ME Beta
KT-----LE Gamma
>>> align.map_position(3, "Gamma") # if no second sequence provided, returns the alignment column
8 
>>> align.map_position(pos=4, from_="Alpha", to_="Beta") # 'E' in Alpha is aligned to '-' in Beta; returns the previous position
3
>>> align.map_letter(2, "Gamma", "Alpha") # returns tuple of letter and position in Alpha; 'L' -> 'M'
('M', 6)

I've started making a PR for this, would love to have some feedback.

I also think it would be good if you could supply an index slice (for either pos, the sequence index, or both) so that you could get several results back as a list. So i'll work on that next.

Thanks in advance! Cheers

The text was updated successfully, but these errors were encountered:

mdehoon · 2023-03-01T00:42:15Z

Note that the new Alignment objects in Bio.Align support this kind of functionality:

>>> from Bio import Align
>>> alignment = Align.read("Clustalw/opuntia.aln", "clustal")
>>> alignment.indices
array([[  0,   1,   2, ..., 143, 144, 145],
       [  0,   1,   2, ..., 145, 146, 147],
       [  0,   1,   2, ..., 143, 144, 145],
       ...,
       [  0,   1,   2, ..., 147, 148, 149],
       [  0,   1,   2, ..., 147, 148, 149],
       [  0,   1,   2, ..., 153, 154, 155]])
>>> alignment.inverse_indices
[array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  66,  67,  68,  69,  70,  71,  72,
        73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,
        86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,
        99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
       112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
       125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
       138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
       151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  66,  67,  68,  69,  70,
        71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
       123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
       136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
       149, 150, 151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  66,  67,  68,  69,  70,
        71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
       123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
       136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
       149, 150, 151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155])]
>>>

kamurani · 2023-03-01T01:19:00Z

I've looked at the alignment.indices property and am confused as to how that could be used to solve my initial problem. I want to map a specific position in one sequence to another, based on where they line up in the overall alignment. I'm loading this alignment in from an external file that includes gaps in the sequences.

When I use the Alignment object:

>>> from Bio.Align import Alignment
>>> a = SeqRecord(Seq("KTLK-E-ME"), id="Alpha")
>>> b = SeqRecord(Seq("KTLK---ME"), id="Beta")
>>> c = SeqRecord(Seq("KT-----LE"), id="Gamma")
>>> aln = Alignment([a, b, c])
>>> aln.indices
array([[0, 1, 2, 3, 4, 5, 6, 7, 8],
       [0, 1, 2, 3, 4, 5, 6, 7, 8],
       [0, 1, 2, 3, 4, 5, 6, 7, 8]])

Not sure if I'm missing something. The code i've added to the referenced PR might more clearly show what i'm trying to do.

Cheers

mdehoon · 2023-03-01T02:43:26Z

Initialization of the Alignment class works differently than for MultipleSeqAlignment. Alignments are either created by parsing a file, or by the pairwise sequence aligner.

If you have e.g. a Fasta file with as contents

>Alpha
KTLK-E-ME
>Beta
KTLK---ME
>Gamma
KT-----LE

then you can do

>>> aln = Align.read("myfastafile.fa", "fasta")
>>> aln.indices
array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 0,  1,  2,  3, -1,  4,  5],
       [ 0,  1, -1, -1, -1,  2,  3]])

Then you see that in column 5, a[5] is aligned to b[4] and c[2].

It is also possible to create an alignment from scratch, but it is more cumbersome:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import Alignment

>>> a = "KTLK-E-ME"
>>> b = "KTLK---ME"
>>> c = "KT-----LE"
>>> coordinates = Alignment.infer_coordinates([a, b, c])

>>> a = SeqRecord(Seq("KTLKEME"), id="Alpha")
>>> b = SeqRecord(Seq("KTLKME"), id="Beta")
>>> c = SeqRecord(Seq("KTLE"), id="Gamma")
>>> aln = Alignment([a, b, c], coordinates=coordinates)

>>> aln.indices
array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 0,  1,  2,  3, -1,  4,  5],
       [ 0,  1, -1, -1, -1,  2,  3]])

mdehoon · 2024-02-08T04:19:28Z

@kamurani Can we close this issue?

kamurani mentioned this issue Mar 1, 2023

MultipleSeqAlignment position mapping #4249

Draft

3 tasks

kamurani closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MultipleSeqAlignment` in `Bio.Align` should have a method for translating positions between sequences and the alignment #4248

`MultipleSeqAlignment` in `Bio.Align` should have a method for translating positions between sequences and the alignment #4248

kamurani commented Mar 1, 2023

mdehoon commented Mar 1, 2023

kamurani commented Mar 1, 2023

mdehoon commented Mar 1, 2023

mdehoon commented Feb 8, 2024

MultipleSeqAlignment in Bio.Align should have a method for translating positions between sequences and the alignment #4248

MultipleSeqAlignment in Bio.Align should have a method for translating positions between sequences and the alignment #4248

Comments

kamurani commented Mar 1, 2023

Setup

Expected behaviour

mdehoon commented Mar 1, 2023

kamurani commented Mar 1, 2023

mdehoon commented Mar 1, 2023

mdehoon commented Feb 8, 2024

`MultipleSeqAlignment` in `Bio.Align` should have a method for translating positions between sequences and the alignment #4248

`MultipleSeqAlignment` in `Bio.Align` should have a method for translating positions between sequences and the alignment #4248