Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of osx GPU support #2489

Open
jchodera opened this issue Nov 28, 2019 · 198 comments
Open

Future of osx GPU support #2489

jchodera opened this issue Nov 28, 2019 · 198 comments

Comments

@jchodera
Copy link
Member

Now that Apple deprecated OpenCL support in osx 10.12 and NVIDIA will no longer provide CUDA support for osx after CUDA 10.2, OpenMM will provide no officially supported GPU-accelerated platforms going forward.

What's our plan for future GPU support on osx? Should it be to examine implementing a Metal platform, or extending the new unified OpenCL/CUDA platform to support it?

cc: #2486 #2471

@peastman
Copy link
Member

They've officially declared OpenCL to be "deprecated", but it's not clear what that means. They also declared OpenGL to be deprecated, but it's hard to imagine them actually removing it. After all, it's the basis of WebGL, a standard that every browser needs to support. So it's not clear whether Apple really plans to remove either one, or if this is just a publicity move to push people toward Metal.

Given that uncertainty, I'm hesitant to take any concrete moves right now. Hopefully at some point it will become clear what their real plans are. If it turns out they really do plan to remove OpenCL, we can evaluate options at that point, which could include adopting a third party OpenCL implementation, adding Metal support, or just dropping support for GPUs on Macs.

@theschles
Copy link

@peastman @jchodera I joined F@H after I read that F@H was doing COVID-19 research. I have a 2017 MacBook Pro with a built-in Radeon Pro 560 that's idling because F@H doesn't at this moment support OSX GPU. I'm sure there's a bunch of people who bought Radeon-integrated MBP's as well who would love to partially dedicate their GPUs to this effort.

(Yes I didread the posts about how support was discontinued and how Apple is deprecating OpenCL and replacing it with Metal).

I would do the fixes myself but my only experience with hardware programming is Arduino -- I'm primarily a web engineer / dev ops.

Who can I lobby to get movement on this? I can be very politely persistent. Is there a specific person I can lobby at ATI and/or Apple to get a move on?

Paste from my System Info below with exact details of my hardware:

  Chipset Model:	Radeon Pro 560
  Type:	GPU
  Bus:	PCIe
  PCIe Lane Width:	x8
  VRAM (Total):	4 GB
  Vendor:	AMD (0x1002)
  Device ID:	0x67ef
  Revision ID:	0x00c0
  ROM Revision:	113-C980AJ-927
  VBIOS Version:	113-C9801AU-A02
  EFI Driver Version:	01.A0.927
  Automatic Graphics Switching:	Supported
  gMux Version:	4.0.29 [3.2.8]
  Metal:	Supported, feature set macOS GPUFamily2 v1
  Displays:
Duet:
  Resolution:	2732 x 2048
  UI Looks like:	1366 x 1024
  Framebuffer Depth:	24-Bit Color (ARGB8888)
  Main Display:	Yes
  Automatically Adjust Brightness:	No

@jchodera
Copy link
Member Author

@theschles: Thanks for your interest in helping out in the Folding@Home effort!

While OpenMM currently supports osx GPUs through the OpenCL platform, we just haven't had the developer effort to build osx cores. We'd need the help of a C++ programmer with extensive osx expertise that could help us craft the build automation necessary to automate osx core building with virtual machine images.

@theschles
Copy link

theschles commented Mar 31, 2020

How's this for starters?

@theschles
Copy link

theschles commented Mar 31, 2020

Another thought: should I try booting a Linux LiveUSB distribution that offers support for AMD GPUs and then run the F@H client from there? I could run that overnight...

@peastman
Copy link
Member

That certainly ought to work. I'd be curious what sort of performance it gets.

@DurvalMenezes
Copy link

DurvalMenezes commented Apr 12, 2020

Folks, there's something wrong; according to Folding@Home's own "OS Stats" page, there are exactly 65 MacOSX active AMD GPUs right now:

image

How come?!

Whatever it is that these folks are doing, I want in too: I sysadmin a MacOSX server with two very capable AMD GPUs that are completely idle because the machine is headless (ie, has no keyboard/mouse/display) and is only used remotely via SSH... would love to contribute this otherwise wasted resource to the F@H COVID-19 effort. And I know some other people in the same situation...

@jchodera
Copy link
Member Author

Huh! We'll look into the stats to see what's going on here.

I'm certainly super excited to build an osx core, but we'll need some free developer bandwidth and likely some help to automate the core builds. We're trying to get more devs onboarded soon, so stay tuned.

@DurvalMenezes
Copy link

@jchodera Thanks for your response.

I'm certainly super excited to build an osx core

That makes it two of us! ;-)

but we'll need some free developer bandwidth and likely some help to automate the core builds. We're trying to get more devs onboarded soon, so stay tuned.

Please count me in! Just sent you an email with my contact info and experience details.

Cheers,
-- Durval.

So

@DurvalMenezes
Copy link

Hello @jchodera,

Huh! We'll look into the stats to see what's going on here.

More data has been posted at a FAH@forum thread discussing the matter; please see from this post on: https://foldingforum.org/viewtopic.php?f=83&t=32895&p=325535#p325008

Cheers,
-- Durval.

@theschles
Copy link

Hi all, I bit the bullet and decided to take the bootcamp plunge until native MacOS support is ready. I'm now running FAH on Windows with ~1.5x the points/day with full use of my MBP 2017's Radeon Pro 560. GPU-Z showing ~100% load on the GPU.

@theschles
Copy link

I'm certainly super excited to build an osx core, but we'll need some free developer bandwidth and likely some help to automate the core builds. We're trying to get more devs onboarded soon, so stay tuned.

Hi @jchodera -- any luck since April to build-in GPU-on-MacOS support?

@peastman peastman closed this as completed Dec 3, 2020
@DurvalMenezes
Copy link

@peastman, closed why? Is is implemented?

@peastman
Copy link
Member

peastman commented Dec 3, 2020

There's nothing to implement. We decided to stick with OpenCL, and Apple has given no indication they plan to remove it. It's even supported on the new ARM based Macs. So unless something changes in the future, there's nothing we need to do.

@theschles
Copy link

@peastman hang on -- if it continues to work on MacOS Catalina and Big Sur, why doesn't Folding@Home then use my AMD GPU?

@jchodera
Copy link
Member Author

@theschles : We haven't yet had a chance to build an osx OpenMM-based core for Folding@home! We're working toward this, though. Thanks to @dotsdl, we're automating the core builds and simplifying the build process to make it easier to expand to other architectures, including osx. We don't have a definite date yet, but hopefully soon!

@theschles
Copy link

@jchodera ok good! If you need a beta tester, please let me know!

@theschles
Copy link

@jchodera ok good! If you need a beta tester, please let me know!

Hi @jchodera and @dotsdl -- following up -- any updates on an OSX GPU OpenMM-based core for Folding@Home?

@jchodera
Copy link
Member Author

@theschles : Thanks to @dotsdl, we have migrated the build infrastructure to use the OpenMM conda-forge packages, which do support both osx-64 and osx-arm64 architectures! We'll be able to give osx builds a try shortly after we tackle the update of win/linux cores to the latest OpenMM 7.6.0.

If you're still willing to be a beta tester, please reach out to me at my email address listed here and we'll hook you up!

@jankais3r
Copy link

If you're still willing to be a beta tester, please reach out to me at my email address listed here and we'll hook you up!

Hi John, what's the current status of the above? I reached out via email few months ago but never heard back...

@DurvalMenezes
Copy link

If you're still willing to be a beta tester, please reach out to me at my email address listed here and we'll hook you up!

Hi John, what's the current status of the above? I reached out via email few months ago but never heard back...

Ditto here. After not hearing anything back, I thought the Mac port has just been abandoned. Even through I'm about to lose access to my 2-GPU Mac Pro, I'd love to hear otherwise...

@jchodera
Copy link
Member Author

jchodera commented Feb 26, 2022 via email

@jchodera
Copy link
Member Author

jchodera commented Feb 26, 2022 via email

@swails
Copy link
Contributor

swails commented Feb 26, 2022

shoot me an email at @.***

I want a vanity email address, too!

@jchodera
Copy link
Member Author

jchodera commented Feb 26, 2022 via email

@philipturner
Copy link
Contributor

I'd be interested in helping you guys maintain support for Apple GPUs. What would it take to help you test it?

@peastman
Copy link
Member

It would be great if we could get the M1 test runner working again so it would be tested on CI. I do a lot of my development on a Macbook M1 Pro, so it still gets a lot of testing.

@philipturner
Copy link
Contributor

philipturner commented Aug 14, 2022

I have an M1 Max MacBook Pro, and I'm quite skilled at debugging software. I also have a lot of expertise with Metal and plans for highly performant FP64 emulation. It's best to just hyperlink so I don't repeat myself:

https://gromacs.bioexcel.eu/t/gpu-acceleration-on-mac-m1-mini/2938/5

I don't think OpenMM is going to deprecate OpenCL, but I would like to see if there's any pain points I could alleviate for your team. I read that Apple's Metal team tried getting you to switch to Metal - did they help you with CI stuff?

@theschles
Copy link

I have an M1 Max MacBook Pro, and I'm quite skilled at debugging software. I also have a lot of expertise with Metal and plans for highly performant FP64 emulation. It's best to just hyperlink so I don't repeat myself:

https://gromacs.bioexcel.eu/t/gpu-acceleration-on-mac-m1-mini/2938/5

I don't think OpenMM is going to deprecate OpenCL, but I would like to see if there's any pain points I could alleviate for your team. I read that Apple's Metal team tried getting you to switch to Metal - did they help you with CI stuff?

Yes please help. I have a 2017 MBP with a Radeon Pro 560 4GB that's been waiting for someone to get this going again.

Please look for @jchodera 's email address at https://www.choderalab.org/members to get involved.

@philipturner
Copy link
Contributor

My computer also has 16 GPU cores. At least, I'm pretty sure it does. Querying CL_DEVICE_MAX_COMPUTE_UNITS in OpenCL returns 16.

We had a lot of discussions about performance, and I assumed you had 14 cores the entire time. Did you purchase a model with 6 or 8 performance CPU cores activated?

@peastman
Copy link
Member

Eight performance cores plus two efficiency cores.

@tristanic
Copy link
Contributor

You might be able to code that as a CustomIntegrator, using CompoundIntegrator to switch between it and the regular integrator. That would let it run on the GPU. Just something to think about.

That's a great tip! I haven't really explored custom integrators before, but looking at it now it seems the sensible thing would be to have the CustomIntegrator just find the speed of the fastest-moving atom and write that to a global variable that I can query in my control code? Looks really neat - will give it a go.

@tristanic
Copy link
Contributor

Getting well off topic from the subject of this thread, but the other thing that happens at those "velocity check" steps is an update of an exponential moving average of the coordinates (which is what's actually shown to the user, to help see through the thermal jitter). That generally has to happen at a rate that's independent of the graphics update interval (otherwise things get really weird if the user sets a particularly long period between updates). It looks like I could also do that in a CustomIntegrator by storing the smoothed coordinates in a per-DOF variable, and reading from there rather than taking the Context->getState() approach. Just wanted to check if that makes sense before properly diving in.

@tristanic
Copy link
Contributor

Oh - and for the fast-moving atoms: if I understand correctly I can get a count of the fast-moving atoms with something like:

        self.addGlobalVariable('fast_count', 0.0)
        self.addComputeSum('fast_count', f'step(sqrt(_x(v)^2+_y(v)^2+_z(v)^2)/({self.MAX_SPEED}*3)-1)')

... with the factor of 3 in the denominator appearing because of this warning in the documentation:

Functions that appear to return a scalar really return a vector whose components are all the same. For example, _z(a) returns the vector (a.z, a.z, a.z). Likewise, wherever a constant appears in the expression, it really means a vector whose components all have the same value.

Am I on the right track?

@tristanic
Copy link
Contributor

I tried implementing both fast-atom count and smoothing as CustomIntegrators - a bunch of rearrangement of glue code needed to get all the controls working again, but the mechanism works fine. Although (somewhat surprisingly), at least on the Mac the speedup is pretty minimal (maybe 5%). I think it's worth doing anyway - for elegance if nothing else, and I'm guessing the benefit might be bigger for systems where there's a bigger mismatch between CPU and GPU performance.

I did notice that in my initial smoothing implementation that two sequentially declared operations (addComputePerDof followed by addComputeSum) were being run in non-deterministic order. According to my understanding of the documentation that shouldn't be. Will file a separate bug report later today.

@tristanic
Copy link
Contributor

Sorry... make that addComputePerDof followed by addComputeGlobal.

@peastman
Copy link
Member

That approach looks correct. Steps should always be executed in exactly the order you specify.

@philipturner
Copy link
Contributor

Very nice! Just compiled this into universal dylibs and tried out in ISOLDE

@tristanic are you using LocalEnergyMinimizer? I just got a test failure on AMD GPUs. Everything else is fine though (except the "very long tests", which I haven't gotten data for yet).

@tristanic
Copy link
Contributor

tristanic commented Apr 26, 2023 via email

@philipturner
Copy link
Contributor

I found where LocalEnergyMinimizer is failing in AMD GPUs. Could somebody reproduce this failure on the OpenCL platform with another AMD GPU?

void testLargeForces() {
    // Create a set of particles that are almost on top of each other so the initial
    // forces are huge.
    
    const int numParticles = 10;
    System system;
    NonbondedForce* nonbonded = new NonbondedForce();
    system.addForce(nonbonded);
    for (int i = 0; i < numParticles; i++) {
        system.addParticle(1.0);
        nonbonded->addParticle(1.0, 0.2, 1.0);
    }
    vector<Vec3> positions(numParticles);
    OpenMM_SFMT::SFMT sfmt;
    init_gen_rand(0, sfmt);
    for (int i = 0; i < numParticles; i++)
        positions[i] = Vec3(genrand_real2(sfmt), genrand_real2(sfmt), genrand_real2(sfmt))*1e-10;

    // Minimize it and verify that it didn't blow up.                                                                               

    VerletIntegrator integrator(0.01);
    Context context(system, integrator, platform);
    context.setPositions(positions);
    LocalEnergyMinimizer::minimize(context, 1.0);
    State state = context.getState(State::Positions);
    double maxdist = 0.0;
    for (int i = 0; i < numParticles; i++) {
        Vec3 r = state.getPositions()[i];
        maxdist = max(maxdist, sqrt(r.dot(r)));
    }
    ASSERT(maxdist > 0.1);
    ASSERT(maxdist < 10.0);
}

@ex-rzr
Copy link
Contributor

ex-rzr commented Apr 28, 2023

@philipturner

I've tested on MI210 (OpenCL from ROCm 5.4.2), LocalEnergyMinimizer tests pass. Only one test of 164 OpenCL tests fails: TestOpenCLDrudeNoseHooverMixed: "Expected 300, found 309.221 (This test is stochastic and may occasionally fail)" (but it is always 309.221:))

@philipturner
Copy link
Contributor

@peastman would you be able to run the test with Mac AMD, OpenCL in-tree backend? It's kind of tedious to get to where you can compile OpenMM from source (e.g. missing something like this that runs out of the box), otherwise I could ask @theschles.

@theschles
Copy link

Hi @philipturner you want me to pull a particular branch of openmm, compile it, then try your metal plugin?

@philipturner
Copy link
Contributor

philipturner commented May 1, 2023

It takes a bit of setup to configure the build. Plus, the tests aren’t sorted by length, so even a simple sanity check could take minutes.

@peastman
Copy link
Member

peastman commented May 1, 2023

Which branch of which repository do you want me to test?

@philipturner
Copy link
Contributor

Let's start with the 8.0 release currently on Conda, then try again with the main branch.

@peastman
Copy link
Member

peastman commented May 1, 2023

I can reproduce the problem. It's from the conversion between floating point and fixed point. Here's the routine that does it:

inline long realToFixedPoint(real x) {
    return (long) (x * 0x100000000);
}

And here's what it produces for some values of x.

x result
nan 0
inf -1
-inf 1
huge positive (e.g. 1e25) -1
huge negative 1

Its handling of overflows clearly isn't IEEE compliant. I'm not sure what we can do about it.

@philipturner
Copy link
Contributor

real scaled_x = x * real(0x100000000);
#if VENDOR_AMD
if (abs(scaled_x) > real(LONG_MAX) - epsilon) {
  return (scaled_x > 0) ? LONG_MAX : -LONG_MAX;
}
#endif
return (long) scaled_x;

@peastman
Copy link
Member

peastman commented May 1, 2023

Yeah, I thought of doing something like that. But I worry it could add significant overhead. And there are a few reasons it may not be worth it.

  • This only is an issue for old GPUs, or maybe only on old Macs. With a RDNA GPU on Linux, the test passes.
  • It only affects an obscure corner case (values that are too large to represent with fixed precision).
  • There's an easy workaround: use the CPU platform for the energy minimization. That's what we always used to tell people to do, before we added the automatic switchover when large forces were detected. So the behavior is no worse than what it used to do everywhere.

@tristanic
Copy link
Contributor

For what it's worth this would have been a problem for ISOLDE ~6 months ago, but since I updated it to use softcore nonbonded potentials it's virtually impossible for force components to exceed the FP32 max.

In case it's useful/interesting, I've attached the results of some benchmarking runs in ISOLDE, comparing CUDA vs. OpenCL in Windows (on an RTX 3070) and Metal vs. OpenCL on a borrowed M2 Mac (I'm also running them on my M1 Pro - difference looks to be ~10-15% compared to the M2). Three different conditions ("high", "medium" and "low" fidelity varying in the cutoff distances and use of implicit solvent). These are somewhat less informative than pure simulations in some respects given the amount of overhead going into graphics updates etc., but the short story is that the Metal plugin gives a substantial real-world improvement in this context. While the absolute numbers are of course nowhere comparable, the relative speedup for Metal vs. OpenCL is pretty similar to CUDA vs. OpenCL on my Windows machine. Will significantly improve usability on the Macs.

isolde_benchmarks_mac_windows.zip

@philipturner
Copy link
Contributor

philipturner commented May 5, 2023

@tristanic your usage pattern might be activating a low-power state, where the GPU throttles clock speed. When rendering, it decreases the clock speed to ensure the frame finishes just in time, with minimum power consumption. It might think you're rendering a graphics application because of the repetitive periods of idleness.

If that's true, utilization is not proportional to power, it is proportional to sqrt(power). 16 W might be 80% peak FLOPS and 25 W is 100%. Can you run powermetrics and paste the duty cycle for clock speed at multiple random times? Ideally it should be mostly 1296 MHz and not any of the intermediate frequencies.

@tristanic
Copy link
Contributor

For a cryo-EM model (i.e. with a static, unchanging volumetric map) GPU active residency is sitting pretty comfortably at 98-99% residency on the 1296 MHz clock speed, with idle residency at 1-2% (and ~0.05% at 389 MHz - all zeros in between) GPU power is around 14W, package power 20W. For my small crystallographic test case where the maps update ~3x per second the GPU residency drops to ~90% (GPU power ~7W, package power 27.5W). That's not surprising - there's a lot more CPU-bound activity in this case that can't easily/safely be farmed off to other threads (initiating structure factor calculations, collecting the results, recontouring and remasking the maps for display...). All in all, I'm happy with that.

@philipturner
Copy link
Contributor

philipturner commented May 11, 2023

I'm investigating a significant optimization for very small (0-2K atoms) and medium-large (5K-20K atoms) systems. Look at the red curve, at 100K-1M atoms, where it's scaling linearly. Project the tangent line all the way until it intersects 60,000* ns/day.

* This is theory, achieved with noble gases (2 Metal kernels/step). Reality is probably 10,000 ns/day.

Molecular Simulation Speed

@tristanic can you state the system sizes you work with (atoms), time step (fs), and asymptotic nonbonded force complexity (ns/day @ 100K, 1M, and 10M atoms)? I also need to know the typical input to system.step; the optimization works by encoding 100-1,000 steps into a single Metal command buffer.

Here are predictions for water, M1 Max. "Before" is the current Metal plugin with no cutoff (<8K atoms) and cutoff (>8K atoms). If you're currently using a cutoff/PME, removing that provides a 4x speedup for asymptotically small systems. My prediction assumes you already employed the no-cutoff optimization.

Atoms Before ns/day After ns/day Speedup
355025 27.2 27.2 $~1.000\times$
98880 91.3 <92.2 $&lt;1.009\times$
33840 220 <269 $&lt;1.22\times$
12255 351 <743 $&lt;2.11\times$
8301 504 <1097 $&lt;2.18\times$
7221 638 <1261 $&lt;1.98\times$
6282 787 <1450 $&lt;1.84\times$
5310 1000 <1716 $&lt;1.72\times$
4527 1220 <2133 $&lt;1.75\times$
3762 1380 <2422 $&lt;1.76\times$
3222 1660 <2828 $&lt;1.70\times$
2661 2010 <3424 $&lt;1.70\times$
2094 2510 <4351 $&lt;1.70\times$
1692 2900 <5386 $&lt;1.86\times$
1335 3190 <6825 $&lt;2.14\times$
1038 3530 <8779 $&lt;2.49\times$
774 3720 <11773 $&lt;3.16\times$
555 3950 <16419 $&lt;4.16\times$
387 4010 <23546 $&lt;5.87\times$
255 4410 <35736 $&lt;8.10\times$

@tristanic
Copy link
Contributor

tristanic commented May 11, 2023 via email

@philipturner
Copy link
Contributor

Beyond that, behind the scenes every 10 timesteps it runs two

Even 10 timesteps would provide some speedup (theoretically up to 10x), but it's possible to squeeze out a little extra performance by going to 1,000. Afterward, there is no difference.

I'm not sure how to interpret your question on asymptotic nonbonded force complexity.

Basically, how complex is the nonbonded part relative to 12-6 Lennard Jones? For example, adding the Coulomb interaction would roughly double the cost for large-enough systems. I need to quantify the ratio of compute cost compared to my reference implementation.

A typical interactive simulation on the Mac would be in the ballpark of 3,000 atoms (if it makes a difference, in most cases about half of these would be fixed in space).

A reasonable estimate might be 50-100% speedup, depending on whether you use a cutoff for nonbonded forces.

@tristanic
Copy link
Contributor

The nonbonded force implementation is at https://github.com/tristanic/isolde/blob/8a7a84a95e5f8ee7f645615be19c72ef513d7ee5/isolde/src/openmm/custom_forces.py#L1749. It's not incredibly costly compared to the standard NonbondedForce in itself - but keep in mind that ISOLDE does also use implicit solvent (GBSAGBnForce -

class GBSAGBnForce(CustomAmberGBForceBase):
) which is much more expensive (but a necessary evil).

I wasn't actually aware that things could run faster for small systems without a nonbonded cutoff. Currently ISOLDE uses either 1.7 or 0.9 nm depending on the user's choice (on a widget allowing them to trade off fidelity for speed).

@philipturner
Copy link
Contributor

The nonbonded force seems quite cheap, so a lot of my performance claims from noble gases should apply. The only limiter is the $O(n)$ parts (which I assume is the complexity of GBSA), which decrease the height of the cyan plateau. Eventually I want to shift the plateau rightward, by maintaining a nearest neighbor list that’s entirely GPU driven.

@philipturner
Copy link
Contributor

philipturner commented May 12, 2023

I managed to make the plateau sharper. 20,279 ns/day @ 1K atoms. 20,279 with 5 digits.

Molecular Simulation Speed

@peastman this is what Apple meant when asking you to switch to Metal. Driver latency is an Achilles heel - the reason PyTorch is so slow on Apple GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants