JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

leculver · 2024-05-03T18:05:51Z

Implement a gymnasium environment, JitCseEnv, to allow us to rapidly iterate on features, reward functions, and model/neural network architecture for JIT CSE optimization. This change:

Creates a hook in the JIT's common subexpression elimination optimization to allow it to be driven by environment variable.
Uses SuperPMI, with the new CSE hook, to drive CSE decision making in the JIT.
Implements a gym environment to manipulate features, rewards, and architecture of the reinforcement learning model to find what works and what doesn't.
Provides a mechanism to see live updates of the training process via Tensorboard, and post-training evaluation against the default CSE Heursitic.

This implements the bare minimum rewards and features needed to experiment with CSE optimization. The current non-normalized features and simple reward function creates a model that is almost as good as the current, hand-written CSE Heuristic in the JIT. Further developments and improvements will likely be done offline, this is meant to be the skeleton of the project that's shared.

More information can be found in the README.md included in this pull request.

Contributes to: #92915.

dotnet-policy-service · 2024-05-03T18:06:16Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2024-05-03T20:15:40Z

@leculver skimmed through and this looks awesome! Will need a bit of time to review. Will try and get you some feedback by early next week.

FYI @dotnet/jit-contrib @matouskozak @mikelle-rogers

Also FYI @NoahIslam @LogarithmicFrog1: you might find the approach Lee is taking here a bit more accessible and/or familiar, if you're still up for some collaboration.

leculver · 2024-05-03T20:17:01Z

No problem, take the time you need.

AndyAyersMS

Overall this looks great. Happy to merge this as is.

Mostly my comments are about clarification and trying to match up what you have here with what I have done previously.

AndyAyersMS · 2024-05-07T18:31:20Z

src/coreclr/scripts/cse_ml/readme.md

I'd like to see a bit more of a writeup about the overall approach, either here or somewhere else. Things like

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

are the rewards discounted or undiscounted?

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

what sort of neural net topology gets built? Why is this a good choice?

how are you producing the aggregate score across all the methods?

I'm 100% in agreement about needing to create more writeup and documentation.

I guess I should have been a bit more clear in the intention of this Pull Request. I consider the code here the absolute minimum starting point that other folks (and myself) can play with to make improvements. The code here is meant as that playground for use over the next couple of months.

When I'm further along in experimenting with different approaches, model architecture, and so on, that's when I plan to write everything up. Some of the techniques will certainly change after I've had more time to experiment in the space, so I didn't write down too much about this base-design because I expect a lot of it to be different.

Here's quick answers to your questions:

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

This version uses incremental rewards by building up a sequence of decisions.

are the rewards discounted or undiscounted?

Rewards are discounted, but not heavily. Actually, we currently just use the stable-baselines default gamma of 0.99. I intentionally haven't tuned hyperparameters in this checkin. Again trying to keep it as simple as possible.

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

Currently, we use % change in the perfscore. This keeps rewards relatively within the same magnitude. Obviously some methods are longer than others and the change in perfscore for choosing a CSE likely doesn't scale with method length, so this is a place for improvement.

My overall goal with this checkin was simplicity and being able to understand what it's doing. Since the model trains successfully (though doesn't beat the current CSE Heursitic), I did not try to refine them further yet.

what sort of neural net topology gets built? Why is this a good choice?

Currently, it's the default for stable-baselines. I can give you the topology, but this was also a non-choice so far. The default network trained successfully, so I haven't dug further into the design (yet).

how are you producing the aggregate score across all the methods?

I'm just averaging the change in perfscore. I like your method better and will update to that next checkin.

AndyAyersMS · 2024-05-07T18:32:33Z

src/coreclr/jit/optcse.cpp

JIT changes look good.

There is some overlap with things from the other RL heuristic but I think it's ok and probably simpler for now to keep them distinct.

AndyAyersMS · 2024-05-07T18:36:44Z

src/coreclr/scripts/cse_ml/jitml/jit_cse.py

+        return REWARD_SCALE * (prev - curr) / prev
+
+    def _is_valid_action(self, action, method):
+        # Terminating is only valid if we have performed a CSE.  Doing no CSEs isn't allowed.


Is this because you track the "no cse" cases separately, so when learning you're always doing some cses?

There will certainly be some instances where doing no cses is the best policy.

My overall goal with this checkin is to get something relatively simple and understandable as the baseline for future work. In this case, my (intentionally) simple reward function isn't capable of understanding an initial "no" choice without adding extra complexity.

A more refined version of this project can and will handle the case where we choose no CSEs to perform, but I did not want to overcomplicate the initial version.

AndyAyersMS · 2024-05-07T18:39:16Z

src/coreclr/scripts/cse_ml/jitml/jit_cse.py

+        if np.isclose(prev, 0.0):
+            return 0.0
+
+        return REWARD_SCALE * (prev - curr) / prev


Maybe this answers my question about how the variability in rewards is handled? Is prev here some fixed policy result (say no cse or the current heuristic)?

The architecture of this model is to individually choose each CSE one after another until "none" is selected. The prev score is the score of the previous decision. For example, let's say the model eventually choses [3, 1, 5, stop]. In the first iteration, prev will be the perfscore of the method with no CSEs and curr will be the perfscore of only CSE 3 chosen. On the second iteration, prev will be the perfscocre of only CSE 3 chose, and curr will be with CSEs [3, 1] chosen. And so on.

This isn't the only way to build training, but it's the one I started with.

AndyAyersMS · 2024-05-07T18:55:30Z

src/coreclr/scripts/cse_ml/evaluate.py

+    no_jit_failure = result[result['failed'] != ModelResult.JIT_FAILED]
+
+    # next calculate how often we improved on the heuristic
+    improved = no_jit_failure[no_jit_failure['model_score'] < no_jit_failure['heuristic_score']]


I think this touches on how the aggregate score is computed.

Generally I like to use the geomean. If we have $N$ methods and for each have base score $b_i$ and diff score $d_i$. Then the aggregate geomean $G$ is

$$ G = e^{{1\over{N}} \sum_i log(d_i/b_i)} $$

(here lower is better) and I expect the "best possible" improvement to be around 0.99 (my policy gets about 0.994).

Ah interesting. I will add this to the next update, thanks for the suggestion!

Getting the code well-factored so it's easy to modify.

- Also better wrapper factoring.

mikelle-rogers · 2024-05-08T22:11:21Z

src/coreclr/scripts/cse_ml/jitml/constants.py

+from .method_context import MethodContext
+
+MIN_CSE = 3
+MAX_CSE = 16


Is this a starting number for min and max CSEs and if this works, then we will extend further?

That's correct. This was the starting point to get something working. We need to think through how to give the model the ability to see and select all CSEs (up to 64 which is the JIT's max). Defining a new architecture is yet another project to work on. I filed that as an issue here: leculver/jitml#8

leculver added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 3, 2024

leculver requested review from TIHan and AndyAyersMS May 3, 2024 18:05

dotnet-policy-service bot assigned leculver May 3, 2024

AndyAyersMS mentioned this pull request May 3, 2024

Investigate improving JIT heuristics with machine learning #92915

Open

8 tasks

This was referenced May 3, 2024

System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanScalarDestination_SpecialValues fails #101721

Closed

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

leculver force-pushed the jitml branch from 08c0bac to 7edaab4 Compare May 6, 2024 14:51

AndyAyersMS approved these changes May 7, 2024

View reviewed changes

leculver force-pushed the jitml branch from 6eaec8f to 21a3415 Compare May 7, 2024 20:38

leculver added 17 commits May 8, 2024 08:14

Initial code

9d940a2

Add notes

0c31d02

Add CSE_HeuristicRLHook

8a5acd5

Move metric print location, double -> int

476063a

Produce non-viable entries, fix output issue

5de67e1

Shuffle features by type

64b4821

Initial JitEnv - not yet working

0aa1d8d

Change to snake_case

4f9fe2a

Initial RL implementation with stable-baselines3

fe5d334

Enable parallel processing, fix some errors

dd762f8

Clean up train.py, allow algorithm selection

690c4c4

Fix paths

b8fd437

Fix issue with null result

8ee8a0f

Save method indexes

313788d

Check if process is still running

2d9e02f

Up argument count before warning

39a46b5

Track more statistics on tensorboard

11b4346

leculver added 24 commits May 8, 2024 08:14

Fix segfault due to null variable

c7647fa

Add superpmi_context

db2566a

Getting the code well-factored so it's easy to modify.

Add tensorboard entry for invalid choices, clear results

b1cfb46

Close the environment

58d8e16

Add documentation for JIT changes

453dbde

Rename method

fa6c60d

Normalize observation

eb023e2

Set return type hint for clarity

5dbbc51

Add RemoveFeaturesWrapper

b6b32d9

Update docstring

d10a28b

Rename function

26d9ed3

Move feature normalization to a wrapper

5e39ded

- Also better wrapper factoring.

Remove import

7e39ed3

Fix warning

f880166

Fix Windows issue

b1ba57e

Properly log when using A2C

333236b

Add readme

8f277ac

Change argument name

5e399ce

Remove whitespace change

566c27a

Format fixes

80bb745

Fix formatting

b54e32f

Update readme: Fix grammar, add a note about evaluation

6b9bdac

Fixed incorrect filename in readme

dee6f4a

Save more data to .json in preparation of other model kinds

05d8574

leculver force-pushed the jitml branch from 21a3415 to 05d8574 Compare May 8, 2024 15:14

build-analysis bot mentioned this pull request May 8, 2024

arm32 fails in CI with "/lib/arm-linux-gnueabihf/libc.so.6: version `GLIBC_2.34' not found" #102030

Closed

mikelle-rogers reviewed May 8, 2024

View reviewed changes

leculver merged commit 279dbe1 into dotnet:main May 9, 2024
106 of 108 checks passed

leculver deleted the jitml branch May 9, 2024 13:44

leculver mentioned this pull request May 10, 2024

Improve evaluation code leculver/jitml#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

leculver commented May 3, 2024 •

edited

dotnet-policy-service bot commented May 3, 2024

AndyAyersMS commented May 3, 2024

leculver commented May 3, 2024

AndyAyersMS left a comment

AndyAyersMS May 7, 2024

leculver May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

AndyAyersMS May 7, 2024

leculver May 7, 2024

mikelle-rogers May 8, 2024

leculver May 10, 2024

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Conversation

leculver commented May 3, 2024 • edited

dotnet-policy-service bot commented May 3, 2024

AndyAyersMS commented May 3, 2024

leculver commented May 3, 2024

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leculver commented May 3, 2024 •

edited