Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

manulera · 2024-01-24T18:28:26Z

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)

3.11.6 (main, Oct  6 2023, 11:55:02) [Clang 13.0.0 (clang-1300.0.29.30)]
CPython
macOS-12.7-x86_64-i386-64bit
1.83

Steps to reproduce

cc: @BjornFJohansson

Let's start with the digestion of the following sequence TAAAAAAAAAAAAGCCGGCAAAAAAATAAAAA with the restriction enzyme that makes two cuts NmeDI. I am using this example to show what happens at both ends, but the same applies if you use one-cutters only. After digestion we would get 3 fragments:

T      AAAAAAAAAAAAGCCGGCAAAAAAA       TAAAAA
ATTTTT      TTTTTTTCGGCCGTTTTTTTATTTT       T

If we use the enzyme search, we get the positions (one-based) where these cuts will occur:

from Bio.Seq import Seq
from Bio.Restriction import NmeDI

seq = Seq('TAAAAAAAAAAAAGCCGGCAAAAAAATAAAAA')
print(NmeDI.search(seq))

# Prints `[2, 27]`

If we start trimming the molecule from the sides, the sequence where the enzyme would make the cut and leave and overhang will disappear, and the behaviour is different when we trim the left and right side:

for i in range(1,7):
    print(i, NmeDI.search(seq[i:]), NmeDI.search(seq[:-i]))

# Prints:
# 1 [26] [2, 27] < As soon as the initial T is removed, the left site is not returned anymore
# 2 [25] [2, 27]
# 3 [24] [2, 27]
# 4 [23] [2, 27]
# 5 [22] [2, 27]
# 6 [21] [2]     < Until the final T is removed, the right site is still returned

I don't know what the right thing to do is. My feeling is that probably the behaviour of the first trimming on either side should work, and the rest not? This would mean accepting a cut like this, where the cutting sites lie exactly at the end, and single-strand DNA is generated (see below). I am not sure if biologically a cut would happen with further trimming.

      AAAAAAAAAAAAGCCGGCAAAAAAA       TAAAA
TTTTT      TTTTTTTCGGCCGTTTTTTTATTTT

The text was updated successfully, but these errors were encountered:

peterjc · 2024-01-24T18:50:42Z

Do you have a one-cutter example to hand?

>>> NmeDI.elucidate()
'cut twice, not yet implemented sorry.'

It should be easier for me to understand what is puzzling you with a single cutter ;)

MarkusPiotrowski · 2024-01-25T08:23:53Z

Without having looked at the code, it seems quite obvious what's happening here. Only the "upper" strand is analyzed and to have a cut before the 2nd position (1-based), you need of course a first position. To have a cut before position 27, you need to have position 27 (but not 28). The "lower" strand is obviously not taken into account.

While this in not a problem for type IIp REs, which cut inside their recognition sequence, it's a problem for type IIs REs, which cut outside. A possible solution would be to check for these enzymes if the reverse complement sequence gives the same result (if the recognition sequence is palindromic), or to check if the sequence is long enough to contain the cuttings for both strands. This could possibly be incorporated into the RE module.

From a biological view, it's most likely unknown, if such enzymes will cut the DNA if one strand cannot be cut at all (because the DNA just ends). (Even some typ IIp REs don't cut (effectively) when their sites are directly at the end of a DNA molecule) So I wouldn't recommend to allow "air cuts", e.g. in your example, I wouldn't return a site if one strand is missing the neighboring base. I think the actual behavior for the "left site" is correct and for the "right site" is incorrect.

manulera · 2024-01-25T11:22:00Z

As @MarkusPiotrowski points out, this can only happen with enzymes that cut outside their recognition site. It seems that only the positions on the watson/upper strand are returned. It's not so much whether the cut is on the left or on the right, instead what matters is whether the cut occurs on the top or bottom strand. Let me show you with three examples with BsaI, BsaXI, BspCNI. Recognition sites are shown in blue, and the cuts are marked in black.

Note that there is a T among the As where the cut will occur. When that T is removed, the cut is no longer returned, as you can see from the code below:

from Bio.Seq import Seq
from Bio.Restriction import BsaI, BsaXI, BspCNI

print(BsaI.search(Seq('GGTCTCATAAAA')))
print(BsaI.search(Seq('GGTCTCATAAA')))
print(BsaI.search(Seq('GGTCTCATAA')))
print(BsaI.search(Seq('GGTCTCATA')))
print(BsaI.search(Seq('GGTCTCAT')))
print(BsaI.search(Seq('GGTCTCA')))
print()

# Prints
# [8]
# [8]
# [8]
# [8]
# [8]
# []

print(BsaXI.search(Seq('AAATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('AATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('ATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('TAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('AAAAAAAAAACAAAAACTCC')))
print()

# prints
# [5]
# [4]
# [3]
# [2]
# []

print(BspCNI.search(Seq('CTCAGAAAAAAAAAT')))
print(BspCNI.search(Seq('CTCAGAAAAAAAAA')))

# prints
# [15]
# []

@MarkusPiotrowski proposes that the default behaviour should be like in the BspCNI case (cutsite is returned only if the resulting cut will leave at least one base on each strand). Conventional wisdom for primer design is to add some extra bases after a cutsite, so this is probably correct.

If you agree, I can implement that change and write the tests. If you know where to look immediately, I would appreciate it, but I can find my way.

manulera · 2024-01-29T09:57:41Z

Hi @MarkusPiotrowski @peterjc do you agree with the proposed change? Should I go ahead?

manulera · 2024-01-31T11:32:05Z

Hello, just following up on this in case you missed the last comment.

peterjc · 2024-01-31T11:37:30Z

I'm deferring to @MarkusPiotrowski on this one.

manulera · 2024-02-13T10:29:14Z

Pinging @MarkusPiotrowski

MarkusPiotrowski · 2024-02-15T12:11:58Z

@manulera The situation is not clear-cut.

Of course, recognition site prediction should work on both strands the same way. So it shouldn't matter which strand is given, because always a dsDNA is assumed as bait. Thus: BsaI.search(Seq('GGTCTCGT') should give the same result as BsaI.search(Seq('GGTCTCGT').reverse_complement()). This is actually not the case.
Now to the question of what limits the detection of a RE site regarding overhangs. I'm pretty convinced that at least the length of the longest overhang must be present in the target sequence. Otherwise, the resulting overhangs would not be defined regarding their length, which would make a tool like this unsuitable for cloning predictions. So, BsaI.search(Seq('GGTCTCNNNN')) should never give a positive result, since the lower strand would be cut after base 11, but the sequence only has a length of 10.
What is finally debatable is whether it is necessary that (in this example) the sequence must have a length of 12. From a bioinformatician's view, 11 should be OK: The recognition site is present, and the overhangs are well-defined regarding their length. As a wet-bench biologist, I would argue that we don't know if the enzyme really works if only one strand can actually be cut. However, also some type IIP enzymes are known to work poorly or not at all, when their recognition sequence is directly at the end of the DNA molecule, but we don't take this into account when doing RE site predictions. Contrary, NEB's online tool NEBcutter and the popular sequence editor SnapGene only recognize BsaI sites if at least one base is cleaved off from the bottom strand. So I tend to be consistent with these.

So please, go on to implement this change. Maybe you come across some references which would strengthen one or the other alternative. In general, it's always difficult for users, if Biopython acts different to some popular tool without a good reason. :-)

manulera added a commit to manulera/biopython that referenced this issue Feb 27, 2024

closes biopython#4604

c58e2f2

manulera linked a pull request Feb 27, 2024 that will close this issue

Fix inconsistent restriction digest cutting behaviour #4638

Open

3 tasks

manulera added a commit to manulera/biopython that referenced this issue Feb 28, 2024

closes biopython#4604

bb13298

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

manulera commented Jan 24, 2024

peterjc commented Jan 24, 2024

MarkusPiotrowski commented Jan 25, 2024

manulera commented Jan 25, 2024 •

edited

manulera commented Jan 29, 2024

manulera commented Jan 31, 2024

peterjc commented Jan 31, 2024

manulera commented Feb 13, 2024

MarkusPiotrowski commented Feb 15, 2024 •

edited

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

Comments

manulera commented Jan 24, 2024

Setup

Steps to reproduce

peterjc commented Jan 24, 2024

MarkusPiotrowski commented Jan 25, 2024

manulera commented Jan 25, 2024 • edited

manulera commented Jan 29, 2024

manulera commented Jan 31, 2024

peterjc commented Jan 31, 2024

manulera commented Feb 13, 2024

MarkusPiotrowski commented Feb 15, 2024 • edited

manulera commented Jan 25, 2024 •

edited

MarkusPiotrowski commented Feb 15, 2024 •

edited