-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Parallelisation of AnalysisBase with multiprocessing and dask #4162
base: develop
Are you sure you want to change the base?
Conversation
Linter Bot Results:Hi @marinegor! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #4162 +/- ##
===========================================
- Coverage 93.38% 93.23% -0.15%
===========================================
Files 171 12 -159
Lines 21744 1079 -20665
Branches 4014 0 -4014
===========================================
- Hits 20305 1006 -19299
+ Misses 952 73 -879
+ Partials 487 0 -487 ☔ View full report in Codecov by Sentry. |
Great to see things are moving @marinegor! I'll have a proper look at this PR when back from holidays, but I just wanted to point out that GitHub is not very smart, so "Does not fix #4162" will actually automatically close #4162 when this is merged. "Related to #4162" or something similar (with no "fix" before the PR number) would avoid this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start. I left a few comments on the code.
@yuxuanzhuang any initial comments so that @marinegor can move forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm done with my replies!
I put in my calendar for early next week "review dask PR". |
I installed a Python 3.11 conda env (macOS) with this branch and I am trying out the example from the docs https://mdanalysis--4162.org.readthedocs.build/en/4162/documentation_pages/analysis_modules.html import multiprocessing
import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF, DCD
from MDAnalysis.analysis.rms import RMSD
from MDAnalysis.analysis.align import AverageStructure
# initialize the universe
u = mda.Universe(PSF, DCD)
# calculate average structure for reference
avg = AverageStructure(mobile=u).run()
ref = avg.results.universe
# initialize RMSD run
rmsd = RMSD(u, ref, select='backbone')
rmsd.run(backend='multiprocessing', n_workers=multiprocessing.cpu_count()) I am running in a Jupyter lab notebook. When I imported MDAnalysis, I got the warning
The However, I get the following messages: When I ran the first time:
When I ran a second time the same command rmsd.run(backend='multiprocessing', n_workers=multiprocessing.cpu_count()) no such messages appeared. When I run with a different number of workers rmsd.run(backend='multiprocessing', n_workers=2) the following shorter messages appear:
Run the same again:
Run again
I am fairly confused what this is about. Can someone else reproduce? But I know that I don't want our users to see these kind of messages. Note that running in a script #!/usr/bin/env python
# -*- coding: utf-8 -*-
# https://github.com/MDAnalysis/mdanalysis/pull/4162
import multiprocessing
import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF, DCD
from MDAnalysis.analysis.rms import RMSD
from MDAnalysis.analysis.align import AverageStructure
if __name__ == "__main__":
# must all be in __main__, otherwise multiprocessing loops indefinitely
# initialize the universe
u = mda.Universe(PSF, DCD)
# calculate average structure for reference
avg = AverageStructure(mobile=u).run()
ref = avg.results.universe
# initialize RMSD run
rmsd = RMSD(u, ref, select='backbone')
rmsd.run(backend='multiprocessing', n_workers=multiprocessing.cpu_count()) does not produce the above messages, so it may be specific for jupyter lab. |
I can reproduce these messages with my Mac (but not in Linux by default) in jupyter. This issue is associated with macOS using The part that doesn't play well with multiprocessing in the Jupyter Notebook environment is the |
Good detective work, @yuxuanzhuang ! @marinegor mentioned that progress bar is not available in parallel analysis, so I didn't expect the bar to appear. (And if I try to get the progress bar with |
On the note of progress bars: Why do I see multiple progress bars when I run with the serial backend but accidentally set It's fine that I get the warning about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am test-driving the PR locally and check what a user would do. Apart from a simple doc fix, so far I found the two issues that I commented on, both inside Jupyter Lab (which is commonly used):
- highly confusing messages that @yuxuanzhuang tracked down to how multiprocessing spawns on macOS (see [GSoC] Parallelisation of AnalysisBase with multiprocessing and dask #4162 (comment)). No error/warning messages should be issued unless the code that is being run fails.
- Multiple progress bars are shown with the serial backend when
n_workers
is set (see [GSoC] Parallelisation of AnalysisBase with multiprocessing and dask #4162 (comment)). There should be only one progress bar in serial.
If I find other things I will add them here.
often trivially parallelizable, meaning you can analyze all frames | ||
independently, and then merge them in a single object. This approach is also | ||
known as "split-apply-combine", and isn't new to MDAnalysis users, since it was | ||
first introduced in [pmda](https://github.com/mdanalysis/pmda). Version 2.8.0 of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix link as inline reST link
first introduced in [pmda](https://github.com/mdanalysis/pmda). Version 2.8.0 of | |
first introduced in `PMDA <https://github.com/mdanalysis/pmda>`_. Version 2.8.0 of |
package/MDAnalysis/analysis/base.py
Outdated
@classmethod | ||
def _is_parallelizable(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a method? Can't we make it a managed attribute ... or just a class attribute?
I find it confusing when is_XXX
has a call signature.
Furthermore, in transformations.TransformationBase
we have the attribute parallelizable
self.parallelizable = kwargs.pop('parallelizable', True) |
_is_parallelizable
into the attribute parallelizable
. It should not be a class attribute of the base class to avoid someone accidentally changing it in a derived class without having overriden it first (which would ultimately mess up the parent class) — or make it a managed attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's a valid concern.
It first was a @property @classmethod
, but:
Class properties are deprecated in Python 3.11 and will not be supported in Python 3.13 // Pylance
So I changed it to be a @classmethod
only. The (initial) reason why it's not an attribute is that there's no such thing as @classattribute
, and I wanted parallelizable
to be a readonly property of a class, and not of its instance, since we don't always know the parameters of the class __init__
to check if it's parallelizable.
As for now, I see two possible solutions to it (see also a gist):
- have
__is_parallelizable
attribute of allAnalysisBase
subclasses, and a@property
that returns it. The downside is that someone could change it to beTrue
, and everything will break -- but I guess that's what you get when you change double-underscore parameter which isn't supposed to be changed. - rename
_is_parallelizable
to something likeget_parallelizability()
and leave it as@classmethod
, hence maintaining the semantics that you prefer, i.e. a verb-containing method is callable.
Also, we wanted to have a default AnalysisBase
attribute/property/method/whatever to make parallelization opt-in, so I can't see a way to make it instantiated in __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just not mess around with double-underscore attributes. If we really wanted to we could un-do the dunder-name mangling but given that Python just does not allow us to really lock away anything, just use convention and if users want to mess up things then they are allowed to. I suggest we use properties with simple class attributes in the following way:
class AnalysisBase:
# class authors: override _analysis_algorithm_is_parallelizable
# in derived classes and only set to True if you have confirmed
# that your algorithm works reliably when parallelized with
# the split-apply-combine approach (see docs)
_analysis_algorithm_is_parallelizable = False
def __init__(self, traj, **kwargs):
...
@property
def parallelizable(self):
"""Read-only attribute that indicates if this analysis can be
performed in parallel.
.. warning::
The author of the code has to determine if the algorithm
can be run in parallel. If users change this setting in any
way then **wrong results** will likely be produced.
"""
return self._analysis_algorithm_is_parallelizable
and then a parallelizable derived class can just use
class MyParallelAnalysis(AnalysisBase):
_analysis_algorithm_is_parallelizable = True
Enterprising users will be able to change _analysis_algorithm_is_parallelizable
but we tell them they shouldn't so that becomes their problem, not ours.
Using an un-documented single underscore attr with a long name should also make clear that this is not something that users ought to touch.
At the end of the day, all of Python APIs are just conventions so I am ok with doing this by convention.
- add reST/sphinx markup for methods and classes and ensure that (most of them) resolve; add intersphinx mapping to dask docs - added cross referencing between parallelization and backends docs - restructured analysis landing page with additional numbered headings for general use and parallelization - add citation for PMDA - fixed links - edited text for flow and readability - added SeeAlsos (eg for User Guide) - added notes/warnings
@classmethod | ||
def is_parallelizable(self): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propose to make this an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion above
results = dask.compute(computations, | ||
scheduler="processes", | ||
chunksize=1, | ||
n_workers=self.n_workers)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask.compute
only recognizes num_workers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
@@ -186,95 +186,99 @@ def correct_values_backbone_group(self): | |||
return [[0, 1, 0, 0, 0], | |||
[49, 50, 4.6997, 1.9154, 2.7139]] | |||
|
|||
def test_rmsd(self, universe, correct_values): | |||
def test_rmsd(self, universe, correct_values, client_RMSD): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the client_RMSD
fixture is used for the first time, it's a bit mysterious where it comes from and what it does because it's not defined in the same file.
Add a comment explaining where client_RMSD
is defined and what it does.
def test_parallelizable_transformations(): | ||
from MDAnalysis.transformations import NoJump | ||
u = mda.Universe(XTC) | ||
u.trajectory.add_transformations(NoJump(parallelizable=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the test is doing.
It does not really make sense to set parallelizable=True
for NoJump
because it is inherently not parallelizable.
And even if parallelizable=True
then why does the parallel execution below fail with ValueError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not really make sense to set parallelizable=True for NoJump because it is inherently not parallelizable.
I just wanted to have any transformation that would allow me to set this attribute. Should I pick another one? Though I changed it to parallelizable=False
, so I think it's not necessary.
And even if parallelizable=True then why does the parallel execution below fail with ValueError?
You're right, it should be that a) NoJump(parallelizable=False)
b) serial works and c) serial fails. I changed the code of AnalysisBase
and the test.
({'stop': 30}, np.arange(30)), | ||
({'step': 10}, np.arange(0, 98, 10)) | ||
]) | ||
def test_start_stop_step_parallel(u, run_kwargs, frames, client_FrameAnalysis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment describing where client_FrameAnalysis
is defined and what it does.
Global fixture definitions in pytest seem necessary but they make for very unreadable code — too much magic going on. Thus, comments are necessary.
@@ -332,15 +396,15 @@ def test_results_type(u): | |||
(20, 50, 2, 15), | |||
(20, 50, None, 30) | |||
]) | |||
def test_AnalysisFromFunction(u, start, stop, step, nframes): | |||
def test_AnalysisFromFunction(u, start, stop, step, nframes, client_AnalysisFromFunction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about location of definition for client_AnalysisFromFunction
and purpose.
Once there are a few of these comments, someone new to the tests will pick up the pattern.
- mark analysis docs as documenting MDAnalysis.analysis so that references resolve properly - link fixes
package/MDAnalysis/analysis/rms.py
Outdated
@classmethod | ||
def _is_parallelizable(cls): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current documentation https://mdanalysis--4162.org.readthedocs.build/en/4162/documentation_pages/analysis/parallelization.html#adding-parallelization-to-your-own-analysis-class states that the method to change is called is_parallelizable()
(no leading underscore):
@classmethod
def is_parallelizable(self):
return True
Docs and code need to be consistent.
(In any case, as said in other comments, I'd prefer an attribute parallelizable
to be consistent with the transformations.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed; also see discussion above.
@@ -2560,3 +2561,14 @@ def no_copy_shim(): | |||
else: | |||
copy = False | |||
return copy | |||
|
|||
|
|||
def is_installed(modulename: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this a new function, add an explicit test for it.
I think, at the moment we get coverage for it because it is used in code but that's not a test of the function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test.
I edited the docs directly instead of making a bunch of comments. In addition to my review, I added additional comments. Please have a look. My only major API question is how to mark an analysis as parallelizable. In the current PR we use a class function |
…n), fixed tests for `is_installed`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments regarding _is_parallelizable
(and fixed documentation), fixed tests for is_installed
.
results = dask.compute(computations, | ||
scheduler="processes", | ||
chunksize=1, | ||
n_workers=self.n_workers)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
""" | ||
for check, msg in self._get_checks().items(): | ||
if not check: | ||
raise ValueError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, so someone could use this function to know how they should configure their backend?
Yes, more or less -- or try to configure it few times without explicitly catching an exception.
I think the usual way to do this would be with docs and tutorials showing to configure the backend
I don't mind changing it, but still think that this way is slightly more flexible to the potential developer.
package/MDAnalysis/analysis/base.py
Outdated
@classmethod | ||
def _is_parallelizable(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's a valid concern.
It first was a @property @classmethod
, but:
Class properties are deprecated in Python 3.11 and will not be supported in Python 3.13 // Pylance
So I changed it to be a @classmethod
only. The (initial) reason why it's not an attribute is that there's no such thing as @classattribute
, and I wanted parallelizable
to be a readonly property of a class, and not of its instance, since we don't always know the parameters of the class __init__
to check if it's parallelizable.
As for now, I see two possible solutions to it (see also a gist):
- have
__is_parallelizable
attribute of allAnalysisBase
subclasses, and a@property
that returns it. The downside is that someone could change it to beTrue
, and everything will break -- but I guess that's what you get when you change double-underscore parameter which isn't supposed to be changed. - rename
_is_parallelizable
to something likeget_parallelizability()
and leave it as@classmethod
, hence maintaining the semantics that you prefer, i.e. a verb-containing method is callable.
Also, we wanted to have a default AnalysisBase
attribute/property/method/whatever to make parallelization opt-in, so I can't see a way to make it instantiated in __init__
.
package/MDAnalysis/analysis/rms.py
Outdated
@classmethod | ||
def _is_parallelizable(cls): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed; also see discussion above.
@@ -2560,3 +2561,14 @@ def no_copy_shim(): | |||
else: | |||
copy = False | |||
return copy | |||
|
|||
|
|||
def is_installed(modulename: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test.
@classmethod | ||
def is_parallelizable(self): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion above
Fixes #4158
Also fixes # 4259 as a check inno it doesn't, I'll do another PR that does that.AnalysisBase.run()
Related to #4158, does not help f-i-x-i-n-g (cudos to github bot) the issue per se but paves the way towards that.Changes made in this Pull Request:
analysis.base.AnalysisBase
, implementing backend configuration, splitting of the frames for analysis, computation, and results aggregationanalysis.backends
introducesBackendBase
class, as well as built-in backendsBackendMultiprocessing
,BackendSerial
andBackendDask
, implementing theapply
method for computations using various backendsanalysis.results
introducesResultsGroup
class that allows for merging of multiple uniformResults
objects from the same module, given appropriate aggregation functionsPR Checklist
📚 Documentation preview 📚: https://mdanalysis--4162.org.readthedocs.build/en/4162/