Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultipleSeqAlignment in Bio.Align should have a method for translating positions between sequences and the alignment #4248

Closed
kamurani opened this issue Mar 1, 2023 · 4 comments · May be fixed by #4249

Comments

@kamurani
Copy link
Contributor

kamurani commented Mar 1, 2023

Setup

>>> import sys; print(sys.version)
3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
macOS-10.16-x86_64-i386-64bit
>>> import Bio; print(Bio.__version__)
1.82.dev0

Expected behaviour

It would be very useful to have an inbuilt way for a MultipleSeqAlignment to calculate the positions in the original sequences (i.e. the original letter indexes, without the gap - characters); given a position in the alignment (or another sequence).

As an example use case, I am aligning protein sequences from different PDB files and wish to compare homologous residues in 3D space. I need to know the original position of a particular residue that has aligned to another, in order to index into the PDB files separately.

>>> from Bio import AlignIO
>>> align = AlignIO.read("Clustalw/opuntia.aln", "clustal")
>>> print(align)
Alignment with 3 rows and 9 columns
KTLK-E-ME Alpha
KTLK---ME Beta
KT-----LE Gamma
>>> align.map_position(3, "Gamma") # if no second sequence provided, returns the alignment column
8 
>>> align.map_position(pos=4, from_="Alpha", to_="Beta") # 'E' in Alpha is aligned to '-' in Beta; returns the previous position
3
>>> align.map_letter(2, "Gamma", "Alpha") # returns tuple of letter and position in Alpha; 'L' -> 'M'
('M', 6)

I've started making a PR for this, would love to have some feedback.

I also think it would be good if you could supply an index slice (for either pos, the sequence index, or both) so that you could get several results back as a list. So i'll work on that next.

Thanks in advance! Cheers

@mdehoon
Copy link
Contributor

mdehoon commented Mar 1, 2023

Note that the new Alignment objects in Bio.Align support this kind of functionality:

>>> from Bio import Align
>>> alignment = Align.read("Clustalw/opuntia.aln", "clustal")
>>> alignment.indices
array([[  0,   1,   2, ..., 143, 144, 145],
       [  0,   1,   2, ..., 145, 146, 147],
       [  0,   1,   2, ..., 143, 144, 145],
       ...,
       [  0,   1,   2, ..., 147, 148, 149],
       [  0,   1,   2, ..., 147, 148, 149],
       [  0,   1,   2, ..., 153, 154, 155]])
>>> alignment.inverse_indices
[array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  66,  67,  68,  69,  70,  71,  72,
        73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,
        86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,
        99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
       112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124,
       125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137,
       138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
       151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
       114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
       127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
       140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
       153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  66,  67,  68,  69,  70,
        71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
       123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
       136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
       149, 150, 151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  66,  67,  68,  69,  70,
        71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
        97,  98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
       110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
       123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135,
       136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
       149, 150, 151, 152, 153, 154, 155]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155])]
>>> 

@kamurani
Copy link
Contributor Author

kamurani commented Mar 1, 2023

I've looked at the alignment.indices property and am confused as to how that could be used to solve my initial problem. I want to map a specific position in one sequence to another, based on where they line up in the overall alignment. I'm loading this alignment in from an external file that includes gaps in the sequences.

When I use the Alignment object:

>>> from Bio.Align import Alignment
>>> a = SeqRecord(Seq("KTLK-E-ME"), id="Alpha")
>>> b = SeqRecord(Seq("KTLK---ME"), id="Beta")
>>> c = SeqRecord(Seq("KT-----LE"), id="Gamma")
>>> aln = Alignment([a, b, c])
>>> aln.indices
array([[0, 1, 2, 3, 4, 5, 6, 7, 8],
       [0, 1, 2, 3, 4, 5, 6, 7, 8],
       [0, 1, 2, 3, 4, 5, 6, 7, 8]])

Not sure if I'm missing something. The code i've added to the referenced PR might more clearly show what i'm trying to do.

Cheers

@mdehoon
Copy link
Contributor

mdehoon commented Mar 1, 2023

Initialization of the Alignment class works differently than for MultipleSeqAlignment. Alignments are either created by parsing a file, or by the pairwise sequence aligner.

If you have e.g. a Fasta file with as contents

>Alpha
KTLK-E-ME
>Beta
KTLK---ME
>Gamma
KT-----LE

then you can do

>>> aln = Align.read("myfastafile.fa", "fasta")
>>> aln.indices
array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 0,  1,  2,  3, -1,  4,  5],
       [ 0,  1, -1, -1, -1,  2,  3]])

Then you see that in column 5, a[5] is aligned to b[4] and c[2].

It is also possible to create an alignment from scratch, but it is more cumbersome:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import Alignment

>>> a = "KTLK-E-ME"
>>> b = "KTLK---ME"
>>> c = "KT-----LE"
>>> coordinates = Alignment.infer_coordinates([a, b, c])

>>> a = SeqRecord(Seq("KTLKEME"), id="Alpha")
>>> b = SeqRecord(Seq("KTLKME"), id="Beta")
>>> c = SeqRecord(Seq("KTLE"), id="Gamma")
>>> aln = Alignment([a, b, c], coordinates=coordinates)

>>> aln.indices
array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 0,  1,  2,  3, -1,  4,  5],
       [ 0,  1, -1, -1, -1,  2,  3]])

@mdehoon
Copy link
Contributor

mdehoon commented Feb 8, 2024

@kamurani Can we close this issue?

@kamurani kamurani closed this as completed Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants