WIP, ENH: faster TPR topology building for large systems #4098

tylerjereddy · 2023-03-28T23:20:25Z

Related to gh-4058

for the same scenario reported there, this drops the read-in time from 76 seconds to 54 seconds; there are some pretty serious caveats though:
- I'm not convinced the testsuite, which passes locally, is actually sufficient to catch mistakes here--consider for example that I don't add atom indices to impropers, dihedrals, and angles since the testsuite only caught me not doing it for bonds (if this is a real issue, this should be useful information for the team re: test coverage)
- some data types may have changed, and in general I wasn't particularly sympathetic to preservation of types or data structures if the test suite didn't catch me
- I didn't check if any of the standard asv benchmarks are sensitive to these changes, nor did I investigate whether the large system-tailored changes here add a large constant overhead for smaller systems, or systems with different distrubutions of topological entities (i.e., larger num molecules to total atom ratio, etc.)

Still a lot of work to do before we're IO-bound though:

📚 Documentation preview 📚: https://readthedocs-preview--4098.org.readthedocs.build/en/4098/

* use an LRU cache to avoid some costly looping in TPR topology construction for large systems * this seems to drop the time required to read in the data for MDAnalysisgh-4058 by 10 %, though reviewers may want to verify on their own massive GROMACS systems; I don't know if we have suitable `asv` benchmarks that are sensitive to this

* hoist some unncessarily repeated operations in the TPR topology construction loops * `impropers`, `dihedrals`, `angles`, and `bonds` data structures in the TPR construction now partially leverage NumPy data structures and operations--danger--the testsuite only appears to be sensitive to `bonds` and I intentionally did not carry over the arithmetic from `remap_` functions if they didn't cause problems in the testsuite--however, it seems likely that we may have some testing issues to address before we can be confident in perf enhancements here? Looks like more tests for proper atom indices in those other data structures might be needed? Or perhaps they get handled as a backup somewhere else I suppose... * I also didn't really pay much attention to anything else the testsuite wasn't sensitive to--if a comment says that a topology data structure is a bunch of tuples in a list, I felt free to ignore that and use a NumPy data structure if the testsuite couldn't detect a change * large TPR local read-in time drops from 71 seconds to 65

* large TPR read-in time drops from 65 seconds to 54 with full test suite passing

github-actions · 2023-03-28T23:21:31Z

Linter Bot Results:

Hi @tylerjereddy! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/4548555943/jobs/8019703680

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

tylerjereddy · 2023-03-28T23:21:50Z

package/MDAnalysis/topology/tpr/utils.py

+            dihedrals.extend(curr_dihedrals)
+        if mt.impr is not None:
+            curr_impropers = np.tile(mt.impr, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.impr), 4))
+            impropers.extend(curr_impropers)


If you look at how i.e., mt.remap_impr works, I think it may be surprising that I can get away with this re: test coverage.

tylerjereddy · 2023-03-28T23:22:38Z

package/MDAnalysis/topology/tpr/utils.py

+        # but the test suite says otherwise, or at least is not sensitive...
+        if mt.bonds is not None:
+            curr_bonds = np.tile(mt.bonds, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.bonds), 2))
+            adder = np.repeat(np.arange(0, mt.number_of_atoms() * mb.molb_nmol, mt.number_of_atoms()), 2 * len(mt.bonds)).reshape(-1, 2) + atom_start_ndx


good that the testsuite forced me to think carefully here, but not so for the quantities below?

tylerjereddy · 2023-03-28T23:25:48Z

package/MDAnalysis/topology/tpr/utils.py

+        moltypes.extend([molblock] * (mt.number_of_atoms() * mb.molb_nmol))
+        charges.extend(np.tile(charges_mol, mb.molb_nmol))
+        masses.extend(np.tile(masses_mol, mb.molb_nmol))
+        elements.extend(np.tile(elements_mol, mb.molb_nmol))


also, as a general comment--this is kind of partial adoption of NumPy, since we could try to embrace NumPy data structures more completely at the higher level rather than extending, but this is where I landed based on the initial bottlenecks from line_profiler

I think ultimately we need to move TPR parsing into being compiled. I was holding off on this on the hope that the TNG format (which was meant to hold topology info etc) might be a good replacement, but I don't think that's going to happen. So I'd probably take these nice improvements and ultimately we might have to cythonize this module.

codecov · 2023-03-28T23:39:44Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.03 ⚠️

Comparison is base (c4f998a) 93.57% compared to head (232c3cf) 93.54%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4098      +/-   ##
===========================================
- Coverage    93.57%   93.54%   -0.03%     
===========================================
  Files          191      192       +1     
  Lines        25086    25165      +79     
  Branches      4051     4057       +6     
===========================================
+ Hits         23473    23540      +67     
- Misses        1093     1105      +12     
  Partials       520      520

Impacted Files	Coverage Δ
package/MDAnalysis/topology/tpr/obj.py	`81.53% <100.00%> (-18.47%)`	⬇️
package/MDAnalysis/topology/tpr/utils.py	`95.72% <100.00%> (+0.24%)`	⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jbarnoud · 2023-03-30T05:09:00Z

Do you have tests to suggest that would reassure you?

tylerjereddy · 2023-03-30T15:54:50Z

Do you have tests to suggest that would reassure you?

I guess the easiest thing might be to assume that we're correct at the moment and add regression tests to preserve current data structures that don't look fully tested like dihedrals? I'm happy to be wrong here though if the feeling is that we're well-covered already.

One other thing I should note for Richard before I forget is that I noticed something like 20 seconds was spent remapping string indices and stuff in _StringInternerMixin. I didn't get anywhere performance-wise with just minor changes, but I'll paste the idea I was playing around with below the fold. I think there were 93 million pure-Python loop hits on some of the lines for my ~25 M particle system.

--- a/package/MDAnalysis/core/topologyattrs.py
+++ b/package/MDAnalysis/core/topologyattrs.py
@@ -696,6 +696,7 @@ class _StringInternerMixin:
        Mashed together the different implementations to keep it DRY.
     """
 
+    #@profile
     def __init__(self, vals, guessed=False):
         self._guessed = guessed
 
@@ -705,16 +706,14 @@ class _StringInternerMixin:
 
         self.nmidx = np.zeros_like(vals, dtype=int)  # the lookup for each atom
         # eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
-
-        for i, val in enumerate(vals):
-            try:
-                self.nmidx[i] = self.namedict[val]
-            except KeyError:
-                nextidx = len(self.namedict)
-                self.namedict[val] = nextidx
-                name_lookup.append(val)
-
-                self.nmidx[i] = nextidx
+        unique_val_strings = set(vals)
+        for idx, val in enumerate(unique_val_strings):
+            self.namedict[val] = idx
+            name_lookup.append(val)
+
+        for unique_val in unique_val_strings:
+            mask = (vals == unique_val)
+            self.nmidx[mask] = self.namedict[unique_val]
 
         self.name_lookup = np.array(name_lookup, dtype=object)
         self.values = self.name_lookup[self.nmidx]
diff --git a/package/MDAnalysis/topology/tpr/utils.py b/package/MDAnalysis/topology/tpr/utils.py
index a72fb5dce..90d27b939 100644
--- a/package/MDAnalysis/topology/tpr/utils.py
+++ b/package/MDAnalysis/topology/tpr/utils.py
@@ -354,8 +354,8 @@ def do_mtop(data, fver, tpr_resid_from_one=False):
             res_start_ndx += mt.number_of_residues()
 
     atomids = Atomids(np.array(atomids, dtype=np.int32))
-    atomnames = Atomnames(np.array(atomnames, dtype=object))
-    atomtypes = Atomtypes(np.array(atomtypes, dtype=object))
+    atomnames = Atomnames(np.array(atomnames, dtype="U"))
+    atomtypes = Atomtypes(np.array(atomtypes, dtype="U"))
     charges = Charges(np.array(charges, dtype=np.float32))
     masses = Masses(np.array(masses, dtype=np.float32))

There are some comments around the module like "woe betide anyone" who changes this, so probably some messy choices to think about. I wonder if we'd be more free to choose much faster data structure options if we were allowed to break back compat--maybe something to think about longer-term.

richardjgowers · 2023-09-17T10:57:24Z

@tylerjereddy thanks for the tip on the string slowness. I think the string interning makes selection much faster, but I'll admit that current construction is probably lacking. I'll have a think if there's a way to have best of both worlds here, I think your example code is in the right direction.

orbeckst · 2023-10-16T21:15:52Z

@richardjgowers @jbarnoud could one of you review this PR and just be generally responsible for it?

orbeckst · 2024-03-29T22:04:33Z

@jbarnoud are you still able to look after this PR?

jbarnoud · 2024-03-29T23:03:47Z

I have been very optimistically keeping the GitHub notification on my list for many month but it is very unlikely I'll find time to look at it.

…

On 29 March 2024 23:04:54 CET, Oliver Beckstein ***@***.***> wrote: @jbarnoud are you still able to look after this PR? -- Reply to this email directly or view it on GitHub: #4098 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

orbeckst · 2024-03-30T03:44:39Z

@richardjgowers might you be able to take over from @jbarnoud to look after the PR, please?

tylerjereddy added 3 commits March 27, 2023 17:43

ENH: faster TPR parsing

232c3cf

* large TPR read-in time drops from 65 seconds to 54 with full test suite passing

tylerjereddy added enhancement Format-Gromacs labels Mar 28, 2023

github-actions bot added the Component-Topology label Mar 28, 2023

tylerjereddy commented Mar 28, 2023

View reviewed changes

jbarnoud self-assigned this Mar 30, 2023

orbeckst assigned richardjgowers and unassigned jbarnoud Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, ENH: faster TPR topology building for large systems #4098

WIP, ENH: faster TPR topology building for large systems #4098

tylerjereddy commented Mar 28, 2023 •

edited by github-actions bot

github-actions bot commented Mar 28, 2023

tylerjereddy Mar 28, 2023

tylerjereddy Mar 28, 2023

tylerjereddy Mar 28, 2023

richardjgowers Mar 30, 2023

codecov bot commented Mar 28, 2023 •

edited

jbarnoud commented Mar 30, 2023

tylerjereddy commented Mar 30, 2023

richardjgowers commented Sep 17, 2023

orbeckst commented Oct 16, 2023

orbeckst commented Mar 29, 2024

jbarnoud commented Mar 29, 2024 via email

orbeckst commented Mar 30, 2024

WIP, ENH: faster TPR topology building for large systems #4098

Are you sure you want to change the base?

WIP, ENH: faster TPR topology building for large systems #4098

Conversation

tylerjereddy commented Mar 28, 2023 • edited by github-actions bot

github-actions bot commented Mar 28, 2023

Linter Bot Results:

tylerjereddy Mar 28, 2023

Choose a reason for hiding this comment

tylerjereddy Mar 28, 2023

Choose a reason for hiding this comment

tylerjereddy Mar 28, 2023

Choose a reason for hiding this comment

richardjgowers Mar 30, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 28, 2023 • edited

Codecov Report

jbarnoud commented Mar 30, 2023

tylerjereddy commented Mar 30, 2023

richardjgowers commented Sep 17, 2023

orbeckst commented Oct 16, 2023

orbeckst commented Mar 29, 2024

jbarnoud commented Mar 29, 2024 via email

orbeckst commented Mar 30, 2024

tylerjereddy commented Mar 28, 2023 •

edited by github-actions bot

codecov bot commented Mar 28, 2023 •

edited