[azure provider] state machine for creating / deleting resources #3058

djmitche · 2020-06-15T16:27:44Z

Move the state machine that handles creation / deletion of Azure resources into its own class with its own state tracking and separate testing.

This lets us ensure this implementation is really bulletproof, while relying on it in the rest of the provider to simply "return a VM or fail trying" and "delete all resources associated with this VM".

djmitche · 2020-09-11T15:06:17Z

The big advantage to this refactor would be more easily testable code and secondarily more readable/modifyable code (say, if we find a need to create different resources later). The downside is introducing new bugs :)

There are a few "weird" things about this process that make it not a "normal" state machine:

state is both in the Azure API and in the workers DB table
some things can happen in parallel (disks) while others must be serialized (ip -> nic -> vm).
there are no "events" that transition from state to state -- we just poll the API

So, I think implementing this as a "normal" state machine with functions for each "state transition" would be difficult and not result in something especially readable or testable.

The current implementation of provisionResources reads something like

provision a NIC
if not done, return
provision an IP
if not done, return
provision a VM
if not done, return
on error, switch to removing the worker

which is a pretty good summary of what's happening. There are some callbacks and functions to modify the worker's provisionerData that don't read too clearly.

Another structure I could think of to handle this would be to set up nested function calls and use an exception to signal that execution is blocked and should be retried on the next poll. Something like

fn provisionResources():
  try:
    createVm()
  on error = Blocked:
    return
  on any other error:
    removeWorker

fn createVm():
  if providerData.vm is complete:
    return vm
  if providerData.vm indicates creation is in progress:
    if Azure API says creation is complete:
      get information on disks and store it in providerData.vm
      return vm
    else:
      raise blocked
  nic = createNic()
  call Azure API to create the VM with 'nic', store the operation in providerData.vm
  raise blocked

fn createNic():
  if providerData.nic is complete:
    return nic
  if providerData.nic indicates creation is in progress:
    if Azure API says creation is complete:
      return nic
    else:
      raise blocked
  ip = createIp()
  call Azure API to create the NIC with that IP, store the operation in providerData.nic
  raise blocked

fn createIp():
  if providerData.ip is complete:
    return ip
  if providerData.ip indicates creation is in progress:
    if Azure API says creation is complete:
      return ip
    else:
      raise blocked
  call Azure API to create the IP, store the operation in providerData.ip
  raise blocked

I'm not sure this is clearer than what we have! Also, the exception trick won't work too well with parallel operations. And, this is definitely no easier to test than what we have.

I'll think about this some more, but maybe what we have is good enough? Ideas?

djmitche created this issue from a note in Worker Lifecycle Management (Ideas) Jun 15, 2020

djmitche added the Estimate: Small label Jun 15, 2020

ccooper added the provisioning label Jun 15, 2020

imbstack added this to the Provisioning is Efficient milestone Jun 17, 2020

ccooper removed the Estimate: Small label Sep 7, 2020

ccooper modified the milestones: Provisioning is Efficient, Service Engineering Backlog Sep 7, 2020

djmitche mentioned this issue Sep 9, 2020

Simplify the logic in checkWorker #3480

Closed

djmitche self-assigned this Sep 11, 2020

djmitche mentioned this issue Sep 14, 2020

[azure provider] factor out provisioning/deprovisioning resources into functions, handle status better #3505

Merged

djmitche closed this as completed in #3505 Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azure provider] state machine for creating / deleting resources #3058

[azure provider] state machine for creating / deleting resources #3058

djmitche commented Jun 15, 2020

djmitche commented Sep 11, 2020

[azure provider] state machine for creating / deleting resources #3058

[azure provider] state machine for creating / deleting resources #3058

Comments

djmitche commented Jun 15, 2020

djmitche commented Sep 11, 2020