-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: faster TPR topology building for large systems #4098
base: develop
Are you sure you want to change the base?
WIP, ENH: faster TPR topology building for large systems #4098
Conversation
* use an LRU cache to avoid some costly looping in TPR topology construction for large systems * this seems to drop the time required to read in the data for MDAnalysisgh-4058 by 10 %, though reviewers may want to verify on their own massive GROMACS systems; I don't know if we have suitable `asv` benchmarks that are sensitive to this
* hoist some unncessarily repeated operations in the TPR topology construction loops * `impropers`, `dihedrals`, `angles`, and `bonds` data structures in the TPR construction now partially leverage NumPy data structures and operations--danger--the testsuite only appears to be sensitive to `bonds` and I intentionally did not carry over the arithmetic from `remap_` functions if they didn't cause problems in the testsuite--however, it seems likely that we may have some testing issues to address before we can be confident in perf enhancements here? Looks like more tests for proper atom indices in those other data structures might be needed? Or perhaps they get handled as a backup somewhere else I suppose... * I also didn't really pay much attention to anything else the testsuite wasn't sensitive to--if a comment says that a topology data structure is a bunch of tuples in a list, I felt free to ignore that and use a NumPy data structure if the testsuite couldn't detect a change * large TPR local read-in time drops from 71 seconds to 65
* large TPR read-in time drops from 65 seconds to 54 with full test suite passing
Linter Bot Results:Hi @tylerjereddy! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
dihedrals.extend(curr_dihedrals) | ||
if mt.impr is not None: | ||
curr_impropers = np.tile(mt.impr, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.impr), 4)) | ||
impropers.extend(curr_impropers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at how i.e., mt.remap_impr
works, I think it may be surprising that I can get away with this re: test coverage.
# but the test suite says otherwise, or at least is not sensitive... | ||
if mt.bonds is not None: | ||
curr_bonds = np.tile(mt.bonds, mb.molb_nmol).reshape((mb.molb_nmol * len(mt.bonds), 2)) | ||
adder = np.repeat(np.arange(0, mt.number_of_atoms() * mb.molb_nmol, mt.number_of_atoms()), 2 * len(mt.bonds)).reshape(-1, 2) + atom_start_ndx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good that the testsuite forced me to think carefully here, but not so for the quantities below?
moltypes.extend([molblock] * (mt.number_of_atoms() * mb.molb_nmol)) | ||
charges.extend(np.tile(charges_mol, mb.molb_nmol)) | ||
masses.extend(np.tile(masses_mol, mb.molb_nmol)) | ||
elements.extend(np.tile(elements_mol, mb.molb_nmol)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, as a general comment--this is kind of partial adoption of NumPy, since we could try to embrace NumPy data structures more completely at the higher level rather than extending, but this is where I landed based on the initial bottlenecks from line_profiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ultimately we need to move TPR parsing into being compiled. I was holding off on this on the hope that the TNG format (which was meant to hold topology info etc) might be a good replacement, but I don't think that's going to happen. So I'd probably take these nice improvements and ultimately we might have to cythonize this module.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #4098 +/- ##
===========================================
- Coverage 93.57% 93.54% -0.03%
===========================================
Files 191 192 +1
Lines 25086 25165 +79
Branches 4051 4057 +6
===========================================
+ Hits 23473 23540 +67
- Misses 1093 1105 +12
Partials 520 520
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Do you have tests to suggest that would reassure you? |
I guess the easiest thing might be to assume that we're correct at the moment and add regression tests to preserve current data structures that don't look fully tested like One other thing I should note for Richard before I forget is that I noticed something like 20 seconds was spent remapping string indices and stuff in --- a/package/MDAnalysis/core/topologyattrs.py
+++ b/package/MDAnalysis/core/topologyattrs.py
@@ -696,6 +696,7 @@ class _StringInternerMixin:
Mashed together the different implementations to keep it DRY.
"""
+ #@profile
def __init__(self, vals, guessed=False):
self._guessed = guessed
@@ -705,16 +706,14 @@ class _StringInternerMixin:
self.nmidx = np.zeros_like(vals, dtype=int) # the lookup for each atom
# eg Atom 5 is 'C', so nmidx[5] = 7, where name_lookup[7] = 'C'
-
- for i, val in enumerate(vals):
- try:
- self.nmidx[i] = self.namedict[val]
- except KeyError:
- nextidx = len(self.namedict)
- self.namedict[val] = nextidx
- name_lookup.append(val)
-
- self.nmidx[i] = nextidx
+ unique_val_strings = set(vals)
+ for idx, val in enumerate(unique_val_strings):
+ self.namedict[val] = idx
+ name_lookup.append(val)
+
+ for unique_val in unique_val_strings:
+ mask = (vals == unique_val)
+ self.nmidx[mask] = self.namedict[unique_val]
self.name_lookup = np.array(name_lookup, dtype=object)
self.values = self.name_lookup[self.nmidx]
diff --git a/package/MDAnalysis/topology/tpr/utils.py b/package/MDAnalysis/topology/tpr/utils.py
index a72fb5dce..90d27b939 100644
--- a/package/MDAnalysis/topology/tpr/utils.py
+++ b/package/MDAnalysis/topology/tpr/utils.py
@@ -354,8 +354,8 @@ def do_mtop(data, fver, tpr_resid_from_one=False):
res_start_ndx += mt.number_of_residues()
atomids = Atomids(np.array(atomids, dtype=np.int32))
- atomnames = Atomnames(np.array(atomnames, dtype=object))
- atomtypes = Atomtypes(np.array(atomtypes, dtype=object))
+ atomnames = Atomnames(np.array(atomnames, dtype="U"))
+ atomtypes = Atomtypes(np.array(atomtypes, dtype="U"))
charges = Charges(np.array(charges, dtype=np.float32))
masses = Masses(np.array(masses, dtype=np.float32))
There are some comments around the module like "woe betide anyone" who changes this, so probably some messy choices to think about. I wonder if we'd be more free to choose much faster data structure options if we were allowed to break back compat--maybe something to think about longer-term. |
@tylerjereddy thanks for the tip on the string slowness. I think the string interning makes selection much faster, but I'll admit that current construction is probably lacking. I'll have a think if there's a way to have best of both worlds here, I think your example code is in the right direction. |
@richardjgowers @jbarnoud could one of you review this PR and just be generally responsible for it? |
@jbarnoud are you still able to look after this PR? |
I have been very optimistically keeping the GitHub notification on my list for many month but it is very unlikely I'll find time to look at it.
…On 29 March 2024 23:04:54 CET, Oliver Beckstein ***@***.***> wrote:
@jbarnoud are you still able to look after this PR?
--
Reply to this email directly or view it on GitHub:
#4098 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
@richardjgowers might you be able to take over from @jbarnoud to look after the PR, please? |
Related to gh-4058
impropers
,dihedrals
, andangles
since the testsuite only caught me not doing it forbonds
(if this is a real issue, this should be useful information for the team re: test coverage)asv
benchmarks are sensitive to these changes, nor did I investigate whether the large system-tailored changes here add a large constant overhead for smaller systems, or systems with different distrubutions of topological entities (i.e., larger num molecules to total atom ratio, etc.)Still a lot of work to do before we're IO-bound though:
📚 Documentation preview 📚: https://readthedocs-preview--4098.org.readthedocs.build/en/4098/