Skip to content

Latest commit

 

History

History
230 lines (184 loc) · 9 KB

File metadata and controls

230 lines (184 loc) · 9 KB

Reading Multiple-sequence Alignment Files

The previous sections looked at Biopython's SeqIO module for sequence file input and output (reading and writing sequence files).

Now we come to the AlignIO module which as the name suggests is for alignment input and ouput. Note that this is focused on dealing with multiple sequence alignments of the kind typically used in phylogenetics - a separate SearchIO module targets pairwise alignments generated by search tools like BLAST.

These examples use a number of real example sequence alignment files, see the sample data instructions in the introduction for how to download them.

We're going to look at a small seed alignment for one of the PFAM domains, the A2L zinc ribbon domain (A2L_zn_ribbon; PF08792). This was picked almost at random - it is small enough to see the entire alignment on screen, and has some obvious gap-rich columns.

From the alignments tab on the Pfam webpage, you can download the raw alignment in several different formats (Selex, Stockholm, FASTA, and MSF). Biopython is able to work with FASTA (very simple) and Stockholm format (richly annotated).

Loading a single Alignment

As in SeqIO, under AlignIO we have both AlignIO.parse(...) for looping over multiple separate alignments, and AlignIO.read(...) for loading a file containing a single alignment.

Most of the time you will be working with alignment files which contain a single alignment, so normally you will use AlignIO.read(..).

Here is an example loading the Pfam seed alignment for the A2L zinc ribbon domain (A2L_zn_ribbon; PF08792):

pycon

>>> from Bio import AlignIO >>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm") >>> print(alignment) SingleLetterAlphabet() alignment with 14 rows and 37 columns SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181 DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33 SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33 AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115 SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133 NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180 KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247 LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177 RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33 MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32 MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32 MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32 NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36 DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181

Printing a Biopython alignment object will give you a display as above (but truncated for larger alignments).

In many ways, the alignment acts like a list of SeqRecord objects (just like you would get from SeqIO). The length of the alignment is the number of rows for example, and you can loop over the rows as individual SeqRecord objects:

pycon

>>> print(len(alignment)) 14 >>> for record in >>> for record in alignment: ... print(record.id + " has " + str(record.seq.count("-")) + " gaps") ... VF282_IIV6/150-181 has 5 gaps Q9YW27_MSEPV/2-33 has 5 gaps Q9EMK1_AMEPV/2-33 has 5 gaps Q6XM16_9PHYC/83-115 has 4 gaps Q8QNH7_ESV1K/101-133 has 4 gaps Q4A276_EHV8U/148-180 has 4 gaps YR429_MIMIV/213-247 has 2 gaps VF385_ASFB7/145-177 has 4 gaps Q6VZA9_CNPV/2-33 has 5 gaps Q6TVP3_ORFSA/1-32 has 5 gaps VLTF3_VACCC/1-32 has 5 gaps Q8V3K7_SWPV1/1-32 has 5 gaps O11357_MCV1/5-36 has 5 gaps Y494R_PBCV1/148-181 has 3 gaps

Exercise: Write a python script called count_gaps.py which reports the number of records, the total number of gaps, and the mean (average) number of gaps per record:

console

$ python count_gaps.py PF08792_seed.sth had 14 records, Total gaps 61, average per record 4.35714285714

Tip: If you get zero as the average, and are using Python 2, add the following special import line to the start of your Python file. This will give natural division (as used in Python 3) rather than integer division (used by default in Python 2):

.. sourcecode:: python

from __future__ import division

Writing Multiple-sequence Alignment Files

As you might guess from using SeqIO.convert(...) and SeqIO.write(...), there are matching AlignIO.convert() and AlignIO.write(...) functions.

For example, this will convert the Stockholm formatted alignment into a relaxed PHYLIP format file:

python

from Bio import AlignIO input_filename = "PF08792_seed.sth" output_filename = "PF08792_seed_converted.phy" AlignIO.convert(input_filename, "stockholm", output_filename, "phylip-relaxed")

Exercise: Modify this example to convert the Stockholm file into a FASTA alignment file.

This AlignIO.convert(...) example is equivalent to using AlignIO.read(...) and AlignIO.write(...) explicitly:

python

from Bio import AlignIO input_filename = "PF08792_seed.sth" output_filename = "PF08792_seed_converted.phy" alignment = AlignIO.read(input_filename, "stockholm") AlignIO.write(alignment, output_filename, "phylip-relaxed")

This form is most useful if you wish to modify the alignment in some way, which we will do next.

Sorting the rows

Downloading from Pfam gives you the option of picking the order the rows appear in - by default this is according to the tree order (clustering similar sequences together), but it can also be alphabetical (using the identifiers).

We downloaded the file using the tree order, but here is how you can sort the rows by identifier within Biopython:

pycon

>>> from Bio import AlignIO >>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm") >>> alignment.sort() >>> print(alignment) SingleLetterAlphabet() alignment with 14 rows and 37 columns NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36 NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180 MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32 RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33 AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115 SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133 MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32 SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33 DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33 SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181 LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177 MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32 DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181 KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247

Exercise: Write a Python script sort_alignment_by_id.py which uses AlignIO.read(..) and AlignIO.write(..) to convert PF08792_seed.sth into a sorted FASTA file.

By default the alignment's sort method uses the identifers as the sort key, but much like how sorting a Python list works, you can override this.

Advanced Exercise: Define your own function taking a single argument (a SeqRecord) which returns the number of gaps in the sequence. Use this to sort the alignment and print it to screen (or save it as a new file):

python

from Bio import AlignIO

def count_gaps(record):

"""Counts number of gaps in record's sequence.""" return 0 # Fill in code

filename = "PF08792_seed.sth" alignment = AlignIO.read(filename, "stockholm") alignment.sort(key=count_gaps) print(alignment)

Expected output:

console

$ python sort_gaps.py SingleLetterAlphabet() alignment with 14 rows and 37 columns KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247 DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181 AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115 SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133 NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180 LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177 SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181 DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33 SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33 RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33 MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32 MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32 MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32 NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36