Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT CSE Optimization - Add a gymnasium environment for reinforcement learning #101856

Merged
merged 78 commits into from May 9, 2024

Conversation

leculver
Copy link
Contributor

@leculver leculver commented May 3, 2024

Implement a gymnasium environment, JitCseEnv, to allow us to rapidly iterate on features, reward functions, and model/neural network architecture for JIT CSE optimization. This change:

  • Creates a hook in the JIT's common subexpression elimination optimization to allow it to be driven by environment variable.
  • Uses SuperPMI, with the new CSE hook, to drive CSE decision making in the JIT.
  • Implements a gym environment to manipulate features, rewards, and architecture of the reinforcement learning model to find what works and what doesn't.
  • Provides a mechanism to see live updates of the training process via Tensorboard, and post-training evaluation against the default CSE Heursitic.

This implements the bare minimum rewards and features needed to experiment with CSE optimization. The current non-normalized features and simple reward function creates a model that is almost as good as the current, hand-written CSE Heuristic in the JIT. Further developments and improvements will likely be done offline, this is meant to be the skeleton of the project that's shared.

More information can be found in the README.md included in this pull request.

Contributes to: #92915.

@leculver leculver added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 3, 2024
@leculver leculver requested review from TIHan and AndyAyersMS May 3, 2024 18:05
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member

@leculver skimmed through and this looks awesome! Will need a bit of time to review. Will try and get you some feedback by early next week.

FYI @dotnet/jit-contrib @matouskozak @mikelle-rogers

Also FYI @NoahIslam @LogarithmicFrog1: you might find the approach Lee is taking here a bit more accessible and/or familiar, if you're still up for some collaboration.

@leculver
Copy link
Contributor Author

leculver commented May 3, 2024

No problem, take the time you need.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great. Happy to merge this as is.

Mostly my comments are about clarification and trying to match up what you have here with what I have done previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a bit more of a writeup about the overall approach, either here or somewhere else. Things like

  • are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?
  • are the rewards discounted or undiscounted?
  • how are you handling the fact that reward magnitudes can vary greatly from one method to the next?
  • what sort of neural net topology gets built? Why is this a good choice?
  • how are you producing the aggregate score across all the methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 100% in agreement about needing to create more writeup and documentation.

I guess I should have been a bit more clear in the intention of this Pull Request. I consider the code here the absolute minimum starting point that other folks (and myself) can play with to make improvements. The code here is meant as that playground for use over the next couple of months.

When I'm further along in experimenting with different approaches, model architecture, and so on, that's when I plan to write everything up. Some of the techniques will certainly change after I've had more time to experiment in the space, so I didn't write down too much about this base-design because I expect a lot of it to be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's quick answers to your questions:

are we learning from full rollouts and eventually from this deducing per-step values (for say A2C), or are you building an incremental reward model by building up longer sequences from shorter ones?

This version uses incremental rewards by building up a sequence of decisions.

are the rewards discounted or undiscounted?

Rewards are discounted, but not heavily. Actually, we currently just use the stable-baselines default gamma of 0.99. I intentionally haven't tuned hyperparameters in this checkin. Again trying to keep it as simple as possible.

how are you handling the fact that reward magnitudes can vary greatly from one method to the next?

Currently, we use % change in the perfscore. This keeps rewards relatively within the same magnitude. Obviously some methods are longer than others and the change in perfscore for choosing a CSE likely doesn't scale with method length, so this is a place for improvement.

My overall goal with this checkin was simplicity and being able to understand what it's doing. Since the model trains successfully (though doesn't beat the current CSE Heursitic), I did not try to refine them further yet.

what sort of neural net topology gets built? Why is this a good choice?

Currently, it's the default for stable-baselines. I can give you the topology, but this was also a non-choice so far. The default network trained successfully, so I haven't dug further into the design (yet).

how are you producing the aggregate score across all the methods?

I'm just averaging the change in perfscore. I like your method better and will update to that next checkin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT changes look good.

There is some overlap with things from the other RL heuristic but I think it's ok and probably simpler for now to keep them distinct.

return REWARD_SCALE * (prev - curr) / prev

def _is_valid_action(self, action, method):
# Terminating is only valid if we have performed a CSE. Doing no CSEs isn't allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you track the "no cse" cases separately, so when learning you're always doing some cses?

There will certainly be some instances where doing no cses is the best policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall goal with this checkin is to get something relatively simple and understandable as the baseline for future work. In this case, my (intentionally) simple reward function isn't capable of understanding an initial "no" choice without adding extra complexity.

A more refined version of this project can and will handle the case where we choose no CSEs to perform, but I did not want to overcomplicate the initial version.

if np.isclose(prev, 0.0):
return 0.0

return REWARD_SCALE * (prev - curr) / prev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this answers my question about how the variability in rewards is handled? Is prev here some fixed policy result (say no cse or the current heuristic)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture of this model is to individually choose each CSE one after another until "none" is selected. The prev score is the score of the previous decision. For example, let's say the model eventually choses [3, 1, 5, stop]. In the first iteration, prev will be the perfscore of the method with no CSEs and curr will be the perfscore of only CSE 3 chosen. On the second iteration, prev will be the perfscocre of only CSE 3 chose, and curr will be with CSEs [3, 1] chosen. And so on.

This isn't the only way to build training, but it's the one I started with.

no_jit_failure = result[result['failed'] != ModelResult.JIT_FAILED]

# next calculate how often we improved on the heuristic
improved = no_jit_failure[no_jit_failure['model_score'] < no_jit_failure['heuristic_score']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this touches on how the aggregate score is computed.

Generally I like to use the geomean. If we have $N$ methods and for each have base score $b_i$ and diff score $d_i$. Then the aggregate geomean $G$ is

$$ G = e^{{1\over{N}} \sum_i log(d_i/b_i)} $$

(here lower is better) and I expect the "best possible" improvement to be around 0.99 (my policy gets about 0.994).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. I will add this to the next update, thanks for the suggestion!

from .method_context import MethodContext

MIN_CSE = 3
MAX_CSE = 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a starting number for min and max CSEs and if this works, then we will extend further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. This was the starting point to get something working. We need to think through how to give the model the ability to see and select all CSEs (up to 64 which is the JIT's max). Defining a new architecture is yet another project to work on. I filed that as an issue here: leculver/jitml#8

@leculver leculver merged commit 279dbe1 into dotnet:main May 9, 2024
106 of 108 checks passed
@leculver leculver deleted the jitml branch May 9, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants