Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modernize HBA to use AG as objects internally instead of selection strings #3933

Open
jaclark5 opened this issue Nov 21, 2022 · 29 comments · May be fixed by #4533
Open

modernize HBA to use AG as objects internally instead of selection strings #3933

jaclark5 opened this issue Nov 21, 2022 · 29 comments · May be fixed by #4533

Comments

@jaclark5
Copy link
Contributor

jaclark5 commented Nov 21, 2022

Is your feature request related to a problem?

modernize HBA (HydrogenBondAnalysis) to use AG (AtomGroups) as objects internally instead of selection strings to avoid the need for HBA analysis to discern the relevant atoms.

Describe the solution you'd like

AG as objects internally instead of selection strings

Additional context

See PR Hbond update #3848 (in particular #3848 (comment))

Guidance for new contributors

This is a reasonably complicated issue to solve for a beginner. We recommend you first learn about the problem. Ask questions here. The more specific your question, then better we can help.

As the first steps you should:

EDITS

  • made issue clearer for new contributors (@orbeckst )
@orbeckst
Copy link
Member

orbeckst commented Feb 9, 2023

Can we offer this issue to GSOC applicants? Opinion @hmacdope ?

@hmacdope
Copy link
Member

hmacdope commented Feb 9, 2023

Why not, would be a good fix.

@orbeckst
Copy link
Member

orbeckst commented Feb 9, 2023

Context: came up with #3848 (comment)

@utkarsh147-del
Copy link

utkarsh147-del commented Feb 12, 2023

Hi so i read the HydrogenBondAnalysis code and find this piece of code using selection string

import MDAnalysis
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA
u = MDAnalysis.Universe(psf, trajectory)
hbonds = HBA(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein")
hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
hbonds.run()

To convert the selected string into AtomGroup Object,we can change this code like this
import MDAnalysis
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA
u = MDAnalysis.Universe(psf, trajectory)
hbonds = HBA(universe=u)
hydrogens = u.select_atoms("protein")
acceptors = u.select_atoms("protein*")
hbonds.hydrogens_sel = hydrogens
hbonds.acceptors_sel = acceptors
hbonds.run()

So here i use select_atoms() function to convert selection string to atom group object.

So i m going to give pull request by converting selection string to atomgroup object in hbond_analysis.py,So if i am approaching correct. please merge my pull request so i have contribution for GSOC

utkarsh147-del added a commit to utkarsh147-del/mdanalysis that referenced this issue Feb 12, 2023
Here I convert hbond code from selection strings to atom group object issue MDAnalysis#3933.If you think my approach is correct then please let me know.I will update this full code.Please merge this pull request so i can have pull request for my gsoc
@RMeli
Copy link
Member

RMeli commented Feb 12, 2023

@utkarsh147-del the select_atoms function selects the specified atoms, while the hbonds.guess_hydrogens and hbonds.guess_acceptors functions guess which hydrogen atoms and which acceptor atoms should be used in the hydrogen bond analysis. In your suggested change, you are completely changing the meaning of the code, since you are using select_atoms("protein") which selects all protein atoms. Have you tried to run the original code and your suggested change, and see what happens?

Additionally, you are suggesting to change code in the docstring. However, this issue concerns the changes to the code (not the documentation, which will nonetheless need to reflect the changes in the code) needed to use atom groups internally instead of selection strings. The fact that you identified hbonds.guess_hydrogens and hbonds.guess_acceptors as returning a selection string instead of an AtomGroup object should give you a hint on where to look further.

In addition to the first steps described above, please familiarise yourself with the HydrogenBondAnalysis class and its use (you can have a look at the user guide), to make sure you understand what it does, before trying to identify pieces of code that use selection strings that will need to be changed to use AtomGroups.

Finally, please use Code and Syntax Highlighting when writing code on GitHub, which makes it far more readable.

Do not hesitate to ask questions and clarifications, before jumping into opening a PR.

@utkarsh147-del
Copy link

@utkarsh147-del the select_atoms function selects the specified atoms, while the hbonds.guess_hydrogens and hbonds.guess_acceptors functions guess which hydrogen atoms and which acceptor atoms should be used in the hydrogen bond analysis. In your suggested change, you are completely changing the meaning of the code, since you are using select_atoms("protein") which selects all protein atoms. Have you tried to run the original code and your suggested change, and see what happens?

Additionally, you are suggesting to change code in the docstring. However, this issue concerns the changes to the code (not the documentation, which will nonetheless need to reflect the changes in the code) needed to use atom groups internally instead of selection strings. The fact that you identified hbonds.guess_hydrogens and hbonds.guess_acceptors as returning a selection string instead of an AtomGroup object should give you a hint on where to look further.

In addition to the first steps described above, please familiarise yourself with the HydrogenBondAnalysis class and its use (you can have a look at the user guide), to make sure you understand what it does, before trying to identify pieces of code that use selection strings that will need to be changed to use AtomGroups.

Finally, please use Code and Syntax Highlighting when writing code on GitHub, which makes it far more readable.

Do not hesitate to ask questions and clarifications, before jumping into opening a PR.

Hi,I understand what you want to convey to me except one thing what do you mean by docstring here,I try to make changes in code which the hbond.py link given>Do we need to make this change in that file or some another python fie.Please clarify me this

@RMeli
Copy link
Member

RMeli commented Feb 12, 2023

@utkarsh147-del by docstring I mean the Python docstring of the hbond_analysis.py module. This is what you are suggesting to change in your comment. Therefore, the commit on your fork of MDAnalysis only changes such doctoring, which is what constitutes the documentation of the hbond_analysis.py module.

hbond_analysis.py is indeed the file that needs to be changed, but the required changes consist in using AtomGroups internally (within the classes and methods of this module) instead of selection strings, not just changing the doctoring/documentation.

I hope this clarifies things.

@SophiaRuan
Copy link
Contributor

Hi @hmacdope @orbeckst, I would like to try this issue. Is it possible for me to work on this one?

@orbeckst
Copy link
Member

@SophiaRuan I don't see a linked PR so if you submit a PR that references this issue, we'll review it.

@HunterZ17
Copy link

Hi @orbeckst @hmacdope, I have been working on this issue recently. Do we need to change these parameters (donors_sel, hydrogens_sel, acceptors_sel, between) which have selection strings as input to Atomgroups as input and change the code accordingly so that it will take ag without any bugs?

@orbeckst
Copy link
Member

No, for right now, input selection strings are ok. If you follow #3848 (comment) then you see it's about internal use of selection strings.

I have been working on this issue recently

I don't see a linked PR. When you create a PR, make sure that you include issue number #3933 in the Fixes #3933 field of the template PR comment.

@yickysan
Copy link

Can we offer this issue to GSOC applicants? Opinion @hmacdope ?

No, for right now, input selection strings are ok. If you follow #3848 (comment) then you see it's about internal use of selection strings.

I have been working on this issue recently

I don't see a linked PR. When you create a PR, make sure that you include issue number #3933 in the Fixes #3933 field of the template PR comment.

For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue

@orbeckst
Copy link
Member

orbeckst commented Mar 30, 2023

If you have a pull request with code up then make sure that the first line of your comment reads

Fixes #3933

i.e., fill in the template that is shown to you when you create the PR.

This will link your PR to this issue. And unless we see this link, we will probably not see or review a PR.

@orbeckst
Copy link
Member

@yickysan

For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue

That's not required and in fact changes the user interface. We do not change the user interface lightly because it can create incompatibilities. It's better to focus a PR on the smallest amount of changes necessary to solve the issue. If you feel that the UI should be changed, raise a specific issue for that case and then it can be discussed and addressed separately.

In order to keep work (and reviewing) manageable, we very much try to keep PRs as small and as self-contained as possible.

@yickysan
Copy link

@yickysan

For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue

That's not required and in fact changes the user interface. We do not change the user interface lightly because it can create incompatibilities. It's better to focus a PR on the smallest amount of changes necessary to solve the issue. If you feel that the UI should be changed, raise a specific issue for that case and then it can be discussed and addressed separately.

In order to keep work (and reviewing) manageable, we very much try to keep PRs as small and as self-contained as possible.

Thank you for reaching back. The new changes I've made, if there isn't a specific select string, the methods should use ag.atoms instead of `ag.select_atoms("all"). This will only be used when we have soemthing like "resname .... and name ..."

@orbeckst
Copy link
Member

@yickysan , if you show your code in a PR we can comment on it; discussions with code present are more productive

@yickysan yickysan mentioned this issue Mar 31, 2023
4 tasks
@lunamorrow
Copy link

I would like to work on this issue for my GSoC application @hmacdope @orbeckst. I am thankfully familiar with both selections strings and AtomGroups from being an MDAnalysis user. I will start tackling the internal use of selection strings and converting them over to AtomGroups. I'll update this issue thread with any questions I have as they rise up.

@hmacdope
Copy link
Member

Sounds good,

@orbeckst @IAlibay we will need to deprecate the old API as well yes?

@lunamorrow
Copy link

I am going to jot down my rough plan here before I get started on digging around in the code, to make sure I have an appropriate plan. If I disregard guess_acceptors, guess_donors, and guess_hydrogens, I can't see an awful lot of selection string usage, but from what I can find here are my ideas.

  • Change return types of guess_acceptors, guess_donors, and guess_hydrogens to AtomGroups instead of selection strings (doctoring examples will need to be updated to reflect this), as well as their internal code that uses selection strings more than once. As part of this (or for a later issue), should we create some kind of setter methods for changing these properties (rather than allowing users to just access and change the stored selection strings, as per some of the docstring examples)?
  • Update _prepare, as will no longer need to use a selection string to get the acceptors AtomGroup
  • Update _get_dh_pairs to use acceptors and hydrogens AtomGroups defined in _prepare, instead of selection strings. Only donor's will still need a selection string as they must be matched to the hydrogen atom groups, so bonding can still be seen by "zipping" atom groups

Please let me know if I am missing anything crucial, otherwise I will get started tomorrow 😊

@IAlibay
Copy link
Member

IAlibay commented Mar 25, 2024

Sounds good,

@orbeckst @IAlibay we will need to deprecate the old API as well yes?

Yeah we should keep it around until 3.0, it's a highly used method and folks won't be too happy if the API changes on them.

@talagayev
Copy link
Contributor

I would also have a question regarding this issue.

Since as @lunamorrow the main cases where string selection is used are guess_acceptors, guess_donors and guess_hydrogens, if we return them as AtomGroups it would allow us to use them later on internally as AtomGroups in _prepare.

As far as i understood you would want to retain the user input, so that he could string selection as an input for HBA and don't change the input of the user to AtomGroups, which means _prepare gets the information either through guess_acceptors as an AtomGroup or from the user as a selection string, which could lead to a problem in the case of the selection, where the user would access the guess_acceptors to get the acceptors and then add manually additional cases for example:

import MDAnalysis as mda
u = mda.Universe(PSF,DCD)
hbonds = HydrogenBondAnalysis(universe=u)

protein_acceptors_sel = hbonds.guess_acceptors("protein")

water_acceptors_sel = "resname TIP3 and name OH2"

hbonds.acceptors_sel = f"({protein_acceptors_sel}) or ({water_acceptors_sel} and around 10 not resname TIP3)"

which would deliver you hbonds.acceptors_sel as a combination of the AtomGroup from guess_acceptors and a string from water_acceptors_sel.

It should be possible to put a condition that would recognize such cases and split up then the input of the user into parts and them treat them separately i think.

@hmacdope
Copy link
Member

hmacdope commented Mar 25, 2024

I am going to jot down my rough plan here before I get started on digging around in the code, to make sure I have an appropriate plan. If I disregard guess_acceptors, guess_donors, and guess_hydrogens, I can't see an awful lot of selection string usage, but from what I can find here are my ideas.

  • Change return types of guess_acceptors, guess_donors, and guess_hydrogens to AtomGroups instead of selection strings (doctoring examples will need to be updated to reflect this), as well as their internal code that uses selection strings more than once.

Yep makes sense to me.

As part of this (or for a later issue), should we create some kind of setter methods for changing these properties (rather than allowing users to just access and change the stored selection strings, as per some of the docstring examples)?

The stored selection strings will be replaced with stored atom groups, but the user should still be able to set them IMO (to an atom group)

  • Update _prepare, as will no longer need to use a selection string to get the acceptors AtomGroup
    Yep
  • Update _get_dh_pairs to use acceptors and hydrogens AtomGroups defined in _prepare, instead of selection strings. Only donor's will still need a selection string as they must be matched to the hydrogen atom groups, so bonding can still be seen by "zipping" atom groups

Should be able to check the indices on two atom groups, but lets see how we go.

Please let me know if I am missing anything crucial, otherwise I will get started tomorrow 😊

You will need to do 2x PRs, 1 to deprecate current API, adding a deprecation warning, and the other to update the functionality. The update to AGs will then need to be held until the next major release cycle. This all still counts for GSOC etc even if PR not merged.

@hmacdope
Copy link
Member

I would also have a question regarding this issue.

Since as @lunamorrow the main cases where string selection is used are guess_acceptors, guess_donors and guess_hydrogens, if we return them as AtomGroups it would allow us to use them later on internally as AtomGroups in _prepare.

Correct

As far as i understood you would want to retain the user input, so that he could string selection as an input for HBA and don't change the input of the user to AtomGroups, which means _prepare gets the information either through guess_acceptors as an AtomGroup or from the user as a selection string, which could lead to a problem in the case of the selection, where the user would access the guess_acceptors to get the acceptors and then add manually additional cases for example:

import MDAnalysis as mda u = mda.Universe(PSF,DCD) hbonds = HydrogenBondAnalysis(universe=u)

protein_acceptors_sel = hbonds.guess_acceptors("protein")

water_acceptors_sel = "resname TIP3 and name OH2"

hbonds.acceptors_sel = f"({protein_acceptors_sel}) or ({water_acceptors_sel} and around 10 not resname TIP3)"

which would deliver you hbonds.acceptors_sel as a combination of the AtomGroup from guess_acceptors and a string from water_acceptors_sel.

It should be possible to put a condition that would recognize such cases and split up then the input of the user into parts and them treat them separately i think.

Yes, either we will need to insist on all AtomGroups for all selections or have functionality that uses and isinstance check to handle the overload of the two. In general we are trying to move towards AtomGroups as the functionality that we pass around, so that would be my preference, but we can explore options.

@talagayev
Copy link
Contributor

@hmacdope Thanks for the reply.

I was tinkering with the guess_acceptors to see if i can convert them to AtomGroups and also then adjust the user input, so that it wouldn't lead to problems regarding the user input.
For the cases where the user is only using either a selection string or an AtomGroup as input it is fine and possible to differentiate between them with isinstance, although i did use type for it, but the difficulty is if they are combined. I managed to get them through specific conditions, but it is too condition heavy i would say and definitely not the optimal way.
I only did it for the acceptors and since me and @lunamorrow are both interested in addressing this issue and both applied for GSOC if it would be fine that i purpose a partial enhancement with the change of acceptors to AG and she does the other parts of the code? although i would say that this issue will anyway not be merged till the deadline, since it could quite delicate with the adjustment of the user input.

@hmacdope
Copy link
Member

@talagayev, I appreciate that GSOC starter issues are in short supply, however given that the work you propose is within the scope of that already proposed by @lunamorrow, I would say they have first chance at making the changes (unless you two can come to an agreement, but I would advise against this as two split PRs will be more compilicated).

For other issues to tackle I would suggest #3937 or #3811

@talagayev
Copy link
Contributor

@hmacdope i agree, since the code would be in the same direction as proposed by @lunamorrow and splitting up the PRs could be tricky, it would be only fair that she works on this issue, while i would focus on the issues that you suggested. I would probably start with #3811.

@lunamorrow
Copy link

lunamorrow commented Mar 26, 2024

I've just finished a big day of uni, so I'll get started on these changes tomorrow now that @talagayev and I have a project each. I will make sure my PR addresses both my previous suggestions and the ones made by Valerij. To clarify/confirm, am I good to change selections across to AtomGroups (as the selection string API will be depreciated)? Therefore, we need to make sure the API reflects users being able to update selections by essentially adding a new AtomGroup (rather than passing a selection string) and combining them with no duplications. Yes?

Thanks for the clarification and direction @hmacdope

The stored selection strings will be replaced with stored atom groups, but the user should still be able to set them IMO (to an atom group)

Ok this makes sense, as from my usage of MDAnalysis thus far it's quite easy and user-friendly to grab and manipulate stored variables.

Should be able to check the indices on two atom groups, but lets see how we go.

Oh yes true! I'll give that a go to really eliminate usage of selection strings :)

You will need to do 2x PRs, 1 to deprecate current API, adding a deprecation warning, and the other to update the functionality. The update to AGs will then need to be held until the next major release cycle. This all still counts for GSOC etc even if PR not merged.

I may need some help with the fiddly aspects of this, as I am very much figuring out how to do things, like making PRs, as I go. Are there any resources that you can point me towards, so that I can ensure I do this correctly from the get go?

@hmacdope
Copy link
Member

@lunamorrow we the feature-branch model detailed here: https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow

Basically,

  • Make a fork of MDAnalysis
  • Make a branch off develop in your fork (git checkout develop && git checkout -b my_cool_branch)
  • Add and commit your changes (git add hbonds.py, `git commit -m 'changed hbonds')
  • Push your changes to your fork (git push origin)
  • Repeat as nessecary
  • Make a pull request on the MDAnalysis pull-requests web interface.

@lunamorrow
Copy link

lunamorrow commented Mar 27, 2024

@hmacdope Great, I have setup a fork on my GitHub repository already and created a new branch for the changes. I will work on the actual code change PR first and then push that, before I start on the PR for the API updates. I expect I will need some guidance with the API PR to ensure I match the format used by MDAnalysis for API depreciation warnings, but I will reach out when I get to this stage.

Just checking, will I need to make a PR with extra test cases in MDAnalysisTests? I expect we may need more for user interactions with guess_acceptors, guess_donors, and guess_hydrogens, but I am not sure if this would already be covered by existing test cases?

Some more questions as I go:

Is there a reason that self.donors_sel is not set in _prepare the same way that self.hydrogens_sel and self.acceptors_sel are? self.donors_sel is never set internally, so it will always be None unless it is passed in by the user during creation of a HydrogenBondAnalysis Object. Is that what we want, or can I enable it to be set in _prepare if it is not already defined during initialisation? I have noticed this same check if not (hasattr(self.u._topology, 'bonds') and len(self.u._topology.bonds.values) != 0): is done in both guess_donors and _guess_dh_pairs, so to me it makes sense for me to condense this and simply set self.donors_sel in _prepare so the extra checks are avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment