Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, ENH: faster TPR topology building for large systems #4098

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Member

@tylerjereddy tylerjereddy commented Mar 28, 2023

Related to gh-4058

  • for the same scenario reported there, this drops the read-in time from 76 seconds to 54 seconds; there are some pretty serious caveats though:
    • I'm not convinced the testsuite, which passes locally, is actually sufficient to catch mistakes here--consider for example that I don't add atom indices to impropers, dihedrals, and angles since the testsuite only caught me not doing it for bonds (if this is a real issue, this should be useful information for the team re: test coverage)
    • some data types may have changed, and in general I wasn't particularly sympathetic to preservation of types or data structures if the test suite didn't catch me
    • I didn't check if any of the standard asv benchmarks are sensitive to these changes, nor did I investigate whether the large system-tailored changes here add a large constant overhead for smaller systems, or systems with different distrubutions of topological entities (i.e., larger num molecules to total atom ratio, etc.)

Still a lot of work to do before we're IO-bound though:
image


📚 Documentation preview 📚: https://readthedocs-preview--4098.org.readthedocs.build/en/4098/

* use an LRU cache to avoid some costly looping
in TPR topology construction for large systems

* this seems to drop the time required to read
in the data for MDAnalysisgh-4058 by 10 %, though reviewers
may want to verify on their own massive GROMACS
systems; I don't know if we have suitable `asv`
benchmarks that are sensitive to this
* hoist some unncessarily repeated operations in the TPR
topology construction loops

* `impropers`, `dihedrals`, `angles`, and `bonds` data
structures in the TPR construction now partially leverage
NumPy data structures and operations--danger--the testsuite
only appears to be sensitive to `bonds` and I intentionally
did not carry over the arithmetic from `remap_` functions if they
didn't cause problems in the testsuite--however, it seems likely
that we may have some testing issues to address before we can
be confident in perf enhancements here? Looks like more tests
for proper atom indices in those other data structures might
be needed? Or perhaps they get handled as a backup somewhere
else I suppose...

* I also didn't really pay much attention to anything else
the testsuite wasn't sensitive to--if a comment says that a topology
data structure is a bunch of tuples in a list, I felt free to
ignore that and use a NumPy data structure if the testsuite couldn't
detect a change

* large TPR local read-in time drops from 71 seconds to 65
* large TPR read-in time drops from 65 seconds
to 54 with full test suite passing
@github-actions
Copy link

Linter Bot Results:

Hi @tylerjereddy! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/4548555943/jobs/8019703680


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

dihedrals.extend(curr_dihedrals)
if mt.impr is not None:
curr_impropers = np.tile(mt.impr, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.impr), 4))
impropers.extend(curr_impropers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at how i.e., mt.remap_impr works, I think it may be surprising that I can get away with this re: test coverage.

# but the test suite says otherwise, or at least is not sensitive...
if mt.bonds is not None:
curr_bonds = np.tile(mt.bonds, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.bonds), 2))
adder = np.repeat(np.arange(0, mt.number_of_atoms() * mb.molb_nmol, mt.number_of_atoms()), 2 * len(mt.bonds)).reshape(-1, 2) + atom_start_ndx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that the testsuite forced me to think carefully here, but not so for the quantities below?

moltypes.extend([molblock] * (mt.number_of_atoms() * mb.molb_nmol))
charges.extend(np.tile(charges_mol, mb.molb_nmol))
masses.extend(np.tile(masses_mol, mb.molb_nmol))
elements.extend(np.tile(elements_mol, mb.molb_nmol))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, as a general comment--this is kind of partial adoption of NumPy, since we could try to embrace NumPy data structures more completely at the higher level rather than extending, but this is where I landed based on the initial bottlenecks from line_profiler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ultimately we need to move TPR parsing into being compiled. I was holding off on this on the hope that the TNG format (which was meant to hold topology info etc) might be a good replacement, but I don't think that's going to happen. So I'd probably take these nice improvements and ultimately we might have to cythonize this module.

@codecov
Copy link

codecov bot commented Mar 28, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.03 ⚠️

Comparison is base (c4f998a) 93.57% compared to head (232c3cf) 93.54%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4098      +/-   ##
===========================================
- Coverage    93.57%   93.54%   -0.03%     
===========================================
  Files          191      192       +1     
  Lines        25086    25165      +79     
  Branches      4051     4057       +6     
===========================================
+ Hits         23473    23540      +67     
- Misses        1093     1105      +12     
  Partials       520      520              
Impacted Files Coverage Δ
package/MDAnalysis/topology/tpr/obj.py 81.53% <100.00%> (-18.47%) ⬇️
package/MDAnalysis/topology/tpr/utils.py 95.72% <100.00%> (+0.24%) ⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jbarnoud jbarnoud self-assigned this Mar 30, 2023
@jbarnoud
Copy link
Contributor

Do you have tests to suggest that would reassure you?

@tylerjereddy
Copy link
Member Author

Do you have tests to suggest that would reassure you?

I guess the easiest thing might be to assume that we're correct at the moment and add regression tests to preserve current data structures that don't look fully tested like dihedrals? I'm happy to be wrong here though if the feeling is that we're well-covered already.

One other thing I should note for Richard before I forget is that I noticed something like 20 seconds was spent remapping string indices and stuff in _StringInternerMixin. I didn't get anywhere performance-wise with just minor changes, but I'll paste the idea I was playing around with below the fold. I think there were 93 million pure-Python loop hits on some of the lines for my ~25 M particle system.

--- a/package/MDAnalysis/core/topologyattrs.py
+++ b/package/MDAnalysis/core/topologyattrs.py
@@ -696,6 +696,7 @@ class _StringInternerMixin:
        Mashed together the different implementations to keep it DRY.
     """
 
+    #@profile
     def __init__(self, vals, guessed=False):
         self._guessed = guessed
 
@@ -705,16 +706,14 @@ class _StringInternerMixin:
 
         self.nmidx = np.zeros_like(vals, dtype=int)  # the lookup for each atom
         # eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
-
-        for i, val in enumerate(vals):
-            try:
-                self.nmidx[i] = self.namedict[val]
-            except KeyError:
-                nextidx = len(self.namedict)
-                self.namedict[val] = nextidx
-                name_lookup.append(val)
-
-                self.nmidx[i] = nextidx
+        unique_val_strings = set(vals)
+        for idx, val in enumerate(unique_val_strings):
+            self.namedict[val] = idx
+            name_lookup.append(val)
+
+        for unique_val in unique_val_strings:
+            mask = (vals == unique_val)
+            self.nmidx[mask] = self.namedict[unique_val]
 
         self.name_lookup = np.array(name_lookup, dtype=object)
         self.values = self.name_lookup[self.nmidx]
diff --git a/package/MDAnalysis/topology/tpr/utils.py b/package/MDAnalysis/topology/tpr/utils.py
index a72fb5dce..90d27b939 100644
--- a/package/MDAnalysis/topology/tpr/utils.py
+++ b/package/MDAnalysis/topology/tpr/utils.py
@@ -354,8 +354,8 @@ def do_mtop(data, fver, tpr_resid_from_one=False):
             res_start_ndx += mt.number_of_residues()
 
     atomids = Atomids(np.array(atomids, dtype=np.int32))
-    atomnames = Atomnames(np.array(atomnames, dtype=object))
-    atomtypes = Atomtypes(np.array(atomtypes, dtype=object))
+    atomnames = Atomnames(np.array(atomnames, dtype="U"))
+    atomtypes = Atomtypes(np.array(atomtypes, dtype="U"))
     charges = Charges(np.array(charges, dtype=np.float32))
     masses = Masses(np.array(masses, dtype=np.float32))

There are some comments around the module like "woe betide anyone" who changes this, so probably some messy choices to think about. I wonder if we'd be more free to choose much faster data structure options if we were allowed to break back compat--maybe something to think about longer-term.

@richardjgowers
Copy link
Member

@tylerjereddy thanks for the tip on the string slowness. I think the string interning makes selection much faster, but I'll admit that current construction is probably lacking. I'll have a think if there's a way to have best of both worlds here, I think your example code is in the right direction.

@orbeckst
Copy link
Member

@richardjgowers @jbarnoud could one of you review this PR and just be generally responsible for it?

@orbeckst
Copy link
Member

@jbarnoud are you still able to look after this PR?

@jbarnoud
Copy link
Contributor

jbarnoud commented Mar 29, 2024 via email

@orbeckst orbeckst assigned richardjgowers and unassigned jbarnoud Mar 30, 2024
@orbeckst
Copy link
Member

@richardjgowers might you be able to take over from @jbarnoud to look after the PR, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants