-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/dssp #4304
base: develop
Are you sure you want to change the base?
Feature/dssp #4304
Conversation
Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-05-29 07:17:22 UTC |
Linter Bot Results:Hi @marinegor! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
There's a problem I don't know the workaround for yet: prolines. In the original
However, in my implementation all prolines (and N-terminal residue) are assigned '-' (=unstructured loop). This is likely less correct than original implementation, but I'm not sure how to fix that.
|
…cture in trajectory
The prolines problem is now dealt with, exactly as I described above -- if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks already very promising.
For an initial quick review please see inline comments.
Additional requests
- Fix the tests: AttributeError: module 'MDAnalysisTests.datafiles' has no attribute 'DSSP'
- We also need a test with a trajectory.
- Compress the PDF files with bz2 or gz.
- Update CHANGELOG.
- Add a
.. versionadded:: 2.7.0
to the docs. - Fix PEP8 complaints.
Initially distinguishing H,E,- is good but eventually it would be good to distinguish other secondary structures, too, in particular 310 and π helices.
package/MDAnalysis/analysis/dssp.py
Outdated
CONST_Q1Q2 = 0.084 | ||
CONST_F = 332 | ||
DEFAULT_CUTOFF = -0.5 | ||
DEFAULT_MARGIN = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these constants from from the original DSSP algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes -- see "Hydrogen-Bonded Structure" part of the paper.
package/MDAnalysis/analysis/dssp.py
Outdated
h3 = h3 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized | ||
h5 = h5 * ~np.roll(helix4, -1, 1) * ~helix4 # helix4 is higher prioritized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get 3_10 and π helix, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain, but seems like yes (used this paper's Fig 1):
from MDAnalysis.analysis.dssp import DSSP
import MDAnalysis as mda
u = mda.Universe('./1FUR.pdb')
r = DSSP(u).run()
r.results.resids[151:158]
# array([155, 156, 157, 158, 159, 160, 161])
''.join(r.results.dssp[0])[151:158]
# 'HHHH-HH'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can detect 3-10 and pi helixes indeed -- I generated idealized structures with mmtbx
from here and ran DSSP on them:
dssp/o_beta_seq.pdb --------------------
dssp/o_helix310_seq.pdb -HHHHHHHHHHHHHHHHHH-
dssp/o_helix_seq.pdb -HHHHHHHHHHHHHHHHHH-
o_helix_pi_seq.pdb -HHHHHHHHHHHHHHHHHH-
Hi @yuxuanzhuang and @orbeckst , thanks for your review! Main changes:
Hope that works! UPD: also again, CI/CD failures are codecov-related, as far as I can see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've corrected a missing reference bracket; everything else seems fine to me! It's a great addition to MDAnalysis!
'--EEEE----------HHHH' | ||
|
||
|
||
Functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Oliver means moving the section of function (i.e. this block) in the documentation after Analysis classes
@yuxuanzhuang I don't get which section you mean. If this one: https://mdanalysis--4304.org.readthedocs.build/en/4304/documentation_pages/analysis_modules.html |
modified :) |
ah yeah I see now, my bad :) thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work here @marinegor.
Please consider anything covered by [enh] as enhancements - i.e. nice to haves.
There are a few small blockers here that need addressing.
A note: this adds roughly ~ 1 MB of extra compressed data to the testsuite. I don't consider this a blocker, but something to keep in mind (currently we are at 56 MB compressed).
@@ -39,6 +39,7 @@ Fixes | |||
* Fix groups.py doctests using sphinx directives (Issue #3925, PR #4374) | |||
|
|||
Enhancements | |||
* Add `analysis.DSSP` module for protein secondary structure assignment, based on [pydssp](https://github.com/ShintaroMinami/PyDSSP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to block on this, but if you can wrap around at 79 chars here that would be grand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
mean_secondary_structure = translate( | ||
long_run.results.dssp_ndarray.mean(axis=0) | ||
) | ||
print(''.join(mean_secondary_structure)[:20]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm re-opening this, I had the exact same question as @orbeckst reading through this. I'm not sure if the 20 residues clarification got removed later for other reasons, so I'm not going to block over this.
|
||
Running this code produces :: | ||
|
||
'-EEEEEE------HHHHHHH' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are better ways of doing this, especially if we want doctest to run well over this.
Something like this:
mdanalysis/package/MDAnalysis/analysis/align.py
Lines 73 to 79 in 73acc9b
In the simplest case, we can simply calculate the C-alpha RMSD between | |
two structures, using :func:`rmsd`:: | |
>>> ref = mda.Universe(PDB_small) | |
>>> mobile = mda.Universe(PSF, DCD) | |
>>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions) | |
28.20178579474479 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding was the opposite, and I recall that code-block:: python
actually gives some errors in doctest. I'll rewrite if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of dislike the >>>
form for longer code examples, at least if I as a human have to read it.
@ljwoods2 can I pick your brain? I think you recently wrote doctests that did not use >>>
. How did you do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike >>>
too, perhaps this was the reason for me thinking that code-block
syntax was correct.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
In this example, we will find residue groups that maintain their secondary structure | ||
along the simulation, and have some meaningful ('E' or 'H') secondary structure | ||
during more than set `threshold` share of frames. We will call these residues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during more than set `threshold` share of frames. We will call these residues | |
during more than set `threshold` fraction of the total frames. We will call these residues |
Feel free to reject, it's just a nitpick - reading this I wasn't sure if the code below was going to expect a set number of frames or a fraction of the total number of frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks.
testsuite/pyproject.toml
Outdated
@@ -72,6 +72,7 @@ include=[ | |||
"MDAnalysisTests.data.gromacs", | |||
"MDAnalysisTests.data.lammps", | |||
"MDAnalysisTests.data.mol2", | |||
"MDAnalysisTests.data.dssp", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific folder location is missing an __init__.py
file - could you add one please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, will fix the rest (selections, hydrogens, and documentation) later.
@@ -39,6 +39,7 @@ Fixes | |||
* Fix groups.py doctests using sphinx directives (Issue #3925, PR #4374) | |||
|
|||
Enhancements | |||
* Add `analysis.DSSP` module for protein secondary structure assignment, based on [pydssp](https://github.com/ShintaroMinami/PyDSSP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
mean_secondary_structure = translate( | ||
long_run.results.dssp_ndarray.mean(axis=0) | ||
) | ||
print(''.join(mean_secondary_structure)[:20]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a little comment about that in the code snippet.
My reasoning behind [:20]
was that the full output is this:
--EEEE----------HHHHHHH----EE----HHHHH------HHHHHHHHHH------HHHHHHHHHHH---------EEEE-----HHHHHHHHH------EEEEEE--HHHHHH----EE--------EE---E----------------------HHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEE------HHHHHHHHH--
which is kinda lengthy for the documentation🤷♂️
|
||
Running this code produces :: | ||
|
||
'-EEEEEE------HHHHHHH' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding was the opposite, and I recall that code-block:: python
actually gives some errors in doctest. I'll rewrite if necessary.
self._guess_hydrogens = guess_hydrogens | ||
|
||
ag: AtomGroup = ( | ||
atoms.select_atoms("protein") if isinstance(atoms, Universe) else atoms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if you pick up a amino acid residue with more protons than is "normally" expected? E.g. ASH or GLUH
I have requested the files of this type on discord, but if you have any in mind, please let me know.
Given you end up doing a lot of protein residue selection below, would it make sense to just do it here once?
Perhaps, yes. I'll try refactoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing my last comments. I found minor things and added a whole bunch of code changes — please add those together with the ones from @IAlibay 's review. Thanks!
mean_secondary_structure = translate( | ||
long_run.results.dssp_ndarray.mean(axis=0) | ||
) | ||
print(''.join(mean_secondary_structure)[:20]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just show everything to make clearer what's happening, especially without explanation? Please remove the [:20]
and add the full output. It's not too long.
|
||
Running this code produces :: | ||
|
||
'-EEEEEE------HHHHHHH' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of dislike the >>>
form for longer code examples, at least if I as a human have to read it.
@ljwoods2 can I pick your brain? I think you recently wrote doctests that did not use >>>
. How did you do that?
|
||
|
||
class DSSP(AnalysisBase): | ||
"""Assign secondary structure using DSSP algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are minor doc issues to fix, please see comments (incorrect reST syntax, missing print()
, ...)
Please also address other reviews.
I am going to approve because in principle I am satisfied. Good work!
manual test a top level, should not be in source
@IAlibay would you mind taking a quick final glance at the DSSP PR? I think your comments were addressed but would prefer you confirm instead of me dismissing a totally valid review. Would be great if we could finally get that one in. |
@IAlibay any comments on this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the long review delay.
I just have the one docstring confusion, once addressed then it's good to go.
Tagging @orbeckst here - please do dismiss my review if this gets addressed and I don't get a chance to re-review.
for calpha, hydrogen in zip( | ||
self._heavy_atoms["CA"][1:], self._hydrogens[1:] | ||
): | ||
if not hydrogen and calpha.resname != "PRO": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the warning, however I'm still a bit confused with this https://github.com/MDAnalysis/mdanalysis/pull/4304/files#diff-cad8095a75b51cba3d6ef256c3c05073b457f309406ae4a94f45ac53257f3e81R202
It claims we're checking for "H" but we only expect one "H" according to the warning? (sorry I'm probably misreading something).
Hi @IAlibay -- couldn't follow your link, maybe you explain in more detail? I updated class's documentation slightly to match the description of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marinegor and @IAlibay I think I understand the point of your discussion: As far as I can tell, I agree with @IAlibay that the current ValueError
check does not exactly check that there's exactly one HN. If we could change the code then that would be safer, please see comments.
I think I also found a potential issue with how _heavy_atoms
and _hydrogens
are constructed. Please check.
.. Warning:: | ||
For the DSSP to work properly, your atoms (either as ``Universe`` | ||
or ``AtomGroup``) must have only one hydrogen atom "H" per residue, | ||
that corresponds to N-bound hydrogen in a peptide bond (or zero hydrogen | ||
atoms, if the residue is proline). It is selected using `hydrogen_atom` | ||
parameter. It is designed to hande most of the modern syntaxes, | ||
but if you encounter an error or unusual results | ||
during your run, please report an issue on MDAnalysi | ||
`github <https://github.com/MDAnalysis/mdanalysis/issues>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the following is clearer?
.. Warning:: | |
For the DSSP to work properly, your atoms (either as ``Universe`` | |
or ``AtomGroup``) must have only one hydrogen atom "H" per residue, | |
that corresponds to N-bound hydrogen in a peptide bond (or zero hydrogen | |
atoms, if the residue is proline). It is selected using `hydrogen_atom` | |
parameter. It is designed to hande most of the modern syntaxes, | |
but if you encounter an error or unusual results | |
during your run, please report an issue on MDAnalysi | |
`github <https://github.com/MDAnalysis/mdanalysis/issues>`_. | |
.. Warning:: | |
For DSSP to work properly, your atoms must represent a protein. The | |
hydrogen atom bound to the backbone nitrogen atom is matched by name | |
as given by the keyword argument `hydrogen_atom`. There may only be | |
a single backbone nitrogen hydrogen atom per residue; the one exception | |
is proline, for which there should not exist any such hydrogens. | |
The default value of `hydrogen_atom` should to handle the common naming | |
conventions in the PDB and in force fields but if you encounter an error | |
or unusual results during your run, try to figure out how to select the | |
correct hydrogen atoms and report an issue in the MDAnalysis | |
`issue tracker <https://github.com/MDAnalysis/mdanalysis/issues>`_. |
.. Note:: | ||
To work with different hydrogen-naming conventions by default, the default | ||
selection is broad but if hydrogens are incorrectly selected (e.g., a | ||
:exc:`ValueError` is raised) you must customize ``hydrogen_name`` for your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference kwargs by single backticks
:exc:`ValueError` is raised) you must customize ``hydrogen_name`` for your | |
:exc:`ValueError` is raised) you must customize `hydrogen_name` for your |
for calpha, hydrogen in zip( | ||
self._heavy_atoms["CA"][1:], self._hydrogens[1:] | ||
): | ||
if not hydrogen and calpha.resname != "PRO": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning says that we expect exactly one H for non-PRO. Can you write the if
with the following logic
if calpha.resname == "PRO":
# special case PRO
if hydrogen:
raise ValueError(f"PRO {calpha.resid} has a hydrogen {hydrogen}")
# else:
# pass # expected: PRO has no hydrogen
elif not hydrogen or len(hydrogen) > 1:
# not a PRO and not exactly one HN: that's bad
raise ValueError(f"{calpha.resname} {calpha.resid} has not exactly one HN, instead it has {hydrogen}")
At least that's my understanding what we should be seeing, @marinegor ?
(There's better code to be written than what I wrote but for following the logic I inserted a pass
etc...)
self._heavy_atoms: dict[str, "AtomGroup"] = { | ||
t: ag.atoms[ | ||
np.isin( | ||
ag.names, t.split() | ||
) # need split() since `np.isin` takes an iterable as second argument | ||
# and "N".split() -> ["N"] | ||
] | ||
for t in heavyatom_names | ||
} | ||
self._hydrogens: list["AtomGroup"] = [ | ||
res.atoms.select_atoms(f"name {hydrogen_name}") for res in ag.residues | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following code I see that there's the assumption that the order of atoms in _heavy_atoms["CA"]
is the same as in _hydrogens
.
However, that may not be true: _hydrogens
is created from a select_atoms()
which always sorts the atoms by index in increasing order whereas ag's in _heavy_atoms
are created by look up and slicing. If the original input ag
is NOT ORDERED then _hydrogens
will be ORDERED whereas _heavy_atoms
remain in the CUSTOM ORDER of the original ag. At least that's what it looks like to me.
It may be safer to just sort the original ag
first.
Fixes #1612
Changes made in this Pull Request:
MDAnalysis.analysis.dssp.DSSP
class for secondary structure analysis, using code implemented inpydssp
package available for secondary structure annotationPR Checklist
Developers certificate of origin
📚 Documentation preview 📚: https://mdanalysis--4304.org.readthedocs.build/en/4304/