Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

Open
manulera opened this issue Jan 24, 2024 · 8 comments · May be fixed by #4638
Open

Inconsistent enzyme.search behaviour when cutting sequence falls outside sequence #4604

manulera opened this issue Jan 24, 2024 · 8 comments · May be fixed by #4638

Comments

@manulera
Copy link
Contributor

Setup

I am reporting a problem with Biopython version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)
3.11.6 (main, Oct  6 2023, 11:55:02) [Clang 13.0.0 (clang-1300.0.29.30)]
CPython
macOS-12.7-x86_64-i386-64bit
1.83

Steps to reproduce

cc: @BjornFJohansson

Let's start with the digestion of the following sequence TAAAAAAAAAAAAGCCGGCAAAAAAATAAAAA with the restriction enzyme that makes two cuts NmeDI. I am using this example to show what happens at both ends, but the same applies if you use one-cutters only. After digestion we would get 3 fragments:

T      AAAAAAAAAAAAGCCGGCAAAAAAA       TAAAAA
ATTTTT      TTTTTTTCGGCCGTTTTTTTATTTT       T

If we use the enzyme search, we get the positions (one-based) where these cuts will occur:

from Bio.Seq import Seq
from Bio.Restriction import NmeDI

seq = Seq('TAAAAAAAAAAAAGCCGGCAAAAAAATAAAAA')
print(NmeDI.search(seq))

# Prints `[2, 27]`

If we start trimming the molecule from the sides, the sequence where the enzyme would make the cut and leave and overhang will disappear, and the behaviour is different when we trim the left and right side:

for i in range(1,7):
    print(i, NmeDI.search(seq[i:]), NmeDI.search(seq[:-i]))

# Prints:
# 1 [26] [2, 27] < As soon as the initial T is removed, the left site is not returned anymore
# 2 [25] [2, 27]
# 3 [24] [2, 27]
# 4 [23] [2, 27]
# 5 [22] [2, 27]
# 6 [21] [2]     < Until the final T is removed, the right site is still returned

I don't know what the right thing to do is. My feeling is that probably the behaviour of the first trimming on either side should work, and the rest not? This would mean accepting a cut like this, where the cutting sites lie exactly at the end, and single-strand DNA is generated (see below). I am not sure if biologically a cut would happen with further trimming.

      AAAAAAAAAAAAGCCGGCAAAAAAA       TAAAA
TTTTT      TTTTTTTCGGCCGTTTTTTTATTTT
@peterjc
Copy link
Member

peterjc commented Jan 24, 2024

Do you have a one-cutter example to hand?

>>> NmeDI.elucidate()
'cut twice, not yet implemented sorry.'

It should be easier for me to understand what is puzzling you with a single cutter ;)

@MarkusPiotrowski
Copy link
Contributor

Without having looked at the code, it seems quite obvious what's happening here. Only the "upper" strand is analyzed and to have a cut before the 2nd position (1-based), you need of course a first position. To have a cut before position 27, you need to have position 27 (but not 28). The "lower" strand is obviously not taken into account.

While this in not a problem for type IIp REs, which cut inside their recognition sequence, it's a problem for type IIs REs, which cut outside. A possible solution would be to check for these enzymes if the reverse complement sequence gives the same result (if the recognition sequence is palindromic), or to check if the sequence is long enough to contain the cuttings for both strands. This could possibly be incorporated into the RE module.

From a biological view, it's most likely unknown, if such enzymes will cut the DNA if one strand cannot be cut at all (because the DNA just ends). (Even some typ IIp REs don't cut (effectively) when their sites are directly at the end of a DNA molecule) So I wouldn't recommend to allow "air cuts", e.g. in your example, I wouldn't return a site if one strand is missing the neighboring base. I think the actual behavior for the "left site" is correct and for the "right site" is incorrect.

@manulera
Copy link
Contributor Author

manulera commented Jan 25, 2024

As @MarkusPiotrowski points out, this can only happen with enzymes that cut outside their recognition site. It seems that only the positions on the watson/upper strand are returned. It's not so much whether the cut is on the left or on the right, instead what matters is whether the cut occurs on the top or bottom strand. Let me show you with three examples with BsaI, BsaXI, BspCNI. Recognition sites are shown in blue, and the cuts are marked in black.

Screenshot 2024-01-25 at 11 15 17 Screenshot 2024-01-25 at 10 35 21 Screenshot 2024-01-25 at 10 38 06

Note that there is a T among the As where the cut will occur. When that T is removed, the cut is no longer returned, as you can see from the code below:

from Bio.Seq import Seq
from Bio.Restriction import BsaI, BsaXI, BspCNI

print(BsaI.search(Seq('GGTCTCATAAAA')))
print(BsaI.search(Seq('GGTCTCATAAA')))
print(BsaI.search(Seq('GGTCTCATAA')))
print(BsaI.search(Seq('GGTCTCATA')))
print(BsaI.search(Seq('GGTCTCAT')))
print(BsaI.search(Seq('GGTCTCA')))
print()

# Prints
# [8]
# [8]
# [8]
# [8]
# [8]
# []

print(BsaXI.search(Seq('AAATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('AATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('ATAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('TAAAAAAAAAACAAAAACTCC')))
print(BsaXI.search(Seq('AAAAAAAAAACAAAAACTCC')))
print()

# prints
# [5]
# [4]
# [3]
# [2]
# []

print(BspCNI.search(Seq('CTCAGAAAAAAAAAT')))
print(BspCNI.search(Seq('CTCAGAAAAAAAAA')))

# prints
# [15]
# []

@MarkusPiotrowski proposes that the default behaviour should be like in the BspCNI case (cutsite is returned only if the resulting cut will leave at least one base on each strand). Conventional wisdom for primer design is to add some extra bases after a cutsite, so this is probably correct.

If you agree, I can implement that change and write the tests. If you know where to look immediately, I would appreciate it, but I can find my way.

@manulera
Copy link
Contributor Author

Hi @MarkusPiotrowski @peterjc do you agree with the proposed change? Should I go ahead?

@manulera
Copy link
Contributor Author

Hello, just following up on this in case you missed the last comment.

@peterjc
Copy link
Member

peterjc commented Jan 31, 2024

I'm deferring to @MarkusPiotrowski on this one.

@manulera
Copy link
Contributor Author

Pinging @MarkusPiotrowski

@MarkusPiotrowski
Copy link
Contributor

MarkusPiotrowski commented Feb 15, 2024

@manulera The situation is not clear-cut.

  1. Of course, recognition site prediction should work on both strands the same way. So it shouldn't matter which strand is given, because always a dsDNA is assumed as bait. Thus: BsaI.search(Seq('GGTCTCGT') should give the same result as BsaI.search(Seq('GGTCTCGT').reverse_complement()). This is actually not the case.
  2. Now to the question of what limits the detection of a RE site regarding overhangs. I'm pretty convinced that at least the length of the longest overhang must be present in the target sequence. Otherwise, the resulting overhangs would not be defined regarding their length, which would make a tool like this unsuitable for cloning predictions. So, BsaI.search(Seq('GGTCTCNNNN')) should never give a positive result, since the lower strand would be cut after base 11, but the sequence only has a length of 10.
    What is finally debatable is whether it is necessary that (in this example) the sequence must have a length of 12. From a bioinformatician's view, 11 should be OK: The recognition site is present, and the overhangs are well-defined regarding their length. As a wet-bench biologist, I would argue that we don't know if the enzyme really works if only one strand can actually be cut. However, also some type IIP enzymes are known to work poorly or not at all, when their recognition sequence is directly at the end of the DNA molecule, but we don't take this into account when doing RE site predictions. Contrary, NEB's online tool NEBcutter and the popular sequence editor SnapGene only recognize BsaI sites if at least one base is cleaved off from the bottom strand. So I tend to be consistent with these.

So please, go on to implement this change. Maybe you come across some references which would strengthen one or the other alternative. In general, it's always difficult for users, if Biopython acts different to some popular tool without a good reason. :-)

manulera added a commit to manulera/biopython that referenced this issue Feb 27, 2024
@manulera manulera linked a pull request Feb 27, 2024 that will close this issue
3 tasks
manulera added a commit to manulera/biopython that referenced this issue Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants