New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NOT FOR MERGING] Simulator for inference autoscaling #107954
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I think the main consideration is do we make the estimators more effective. For example, I think one natural choice would be to try and estimate rate dynamics (say derivative). Also I think we should consider rejigging the estimate of inference time to allow for change as a function of allocation count. My expectation is this might yield significantly lower latency spikes.
…s a function of time and allocation count
…rch into autoscaling-simulator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly I think it is worthwhile commenting in a few places if other people need to follow this code later. Also I have one comment regarding initialisation. Anyway, results LGTM.
def add(self, value, variance, dynamics_changed=False) -> None: | ||
process_variance = variance / self.smoothing_factor | ||
|
||
if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment here for the record...
if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100: | |
# If we know we likely had a change in the quantity we're estimating or the prediction is 10 std off | |
# we inject extra noise in the dynamics for this step. | |
if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thought. The estimator will behave badly with outliers: we'll immediately update the state to something in the vicinity of the outlier because of the 10 sigma rule.
For rate this is probably not so likely, but for inference time estimation there could be environmental factors which cause occasional very slow inferences. I would try simulating with occasional 10-20 x inference time values. As a minimum I would probably drop the 10 sigma rule for the inference time Q estimate. The approach which estimates the constant of proportionality between average inference time and allocation count as an extra state variable will also be robust to outliers.
random.seed(170681) | ||
|
||
|
||
class Estimator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is useful to discuss the basic ideas behind this for anyone else coming to this code
class Estimator: | |
class Estimator: | |
""" | |
This implements a 1 d Kalman filter with manoeuvre detection. Rather than a derived dynamics model | |
we simply fix how much we want to smooth in the steady state. | |
""" |
return self.value + math.sqrt(self.variance) | ||
|
||
|
||
class Simulator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above I would give some motivation for this:
class Simulator: | |
class Simulator: | |
""" | |
We simulate a Poisson process for inference arrivals with time varying rate parameter. This models a | |
variety of rate dynamics: smooth ramp, smooth periodic, shock and steady. | |
Inference is modelled as mean duration + uniform noise. The mean duration captures the behaviour of | |
vCPUs which is that throughput is largely constant after all physical cores on a node are occupied. | |
We assume perfect load balancing, i.e. that every allocation of inference is never waiting whilst there are | |
inferences to be done. Inferences can only be picked off once they have arrived and allocations can only | |
pick an inference "inference duration" after they picked their last inference. We assume they are selected | |
FIFO. | |
The key user parameters are the number of waiting inferences and the average and maximum delay to | |
receive each inference which can be calculated from the difference between when inference calls arrived | |
and when they are available. | |
""" |
TIME_STEP = 0.001 | ||
time = START_TIME + TIME_STEP / 2 | ||
while time < END_TIME: | ||
if random.random() < self.get_request_rate(time) * TIME_STEP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This saturates if rate is > 1 / 0.001. I guess it doesn't matter for the parameters you use for simulation, but it would be a little more robust if you generate next time instead as t += numpy.random.exponential(scale=1/self.get_request_rate(time))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading on I see this will have an impact on implementation of run stat estimation. This could all be resolved, but let's not bother to do that now. I think a comment is warranted that the simulator should only use rate << 1 / TIME_STEP.
latency_estimator = Estimator(0.1, 100, 1e6) | ||
rate_estimator = Estimator(0, 100, 1e3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to tie the initial variance to the measurement variance.
Alternatively, my inclination would probably be to initialise variance as None and fix the gain for the first measurement, i.e. gain = 0.99 if variance is None else (self.variance + process_variance) / (self.variance + process_variance + variance)
.
self.real_loads.append((time, self.get_request_rate(time) * self.get_avg_inference_time() / self.num_allocations)) | ||
|
||
# autoscaling | ||
needed_allocations_lower = latency_estimator.lower() * rate_estimator.lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important here, but in the actual system if we suddenly stop receiving inference requests having received them at a high rate we will never trigger down scaling if this is event driven. You will instead need some sort of heartbeat to check scaling.
Legend
Request count / inference time:
Wait time:
Note: all times are in seconds