Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azure provider] state machine for creating / deleting resources #3058

Closed
djmitche opened this issue Jun 15, 2020 · 1 comment · Fixed by #3505
Closed

[azure provider] state machine for creating / deleting resources #3058

djmitche opened this issue Jun 15, 2020 · 1 comment · Fixed by #3505

Comments

@djmitche
Copy link
Collaborator

Move the state machine that handles creation / deletion of Azure resources into its own class with its own state tracking and separate testing.

This lets us ensure this implementation is really bulletproof, while relying on it in the rest of the provider to simply "return a VM or fail trying" and "delete all resources associated with this VM".

@djmitche
Copy link
Collaborator Author

The big advantage to this refactor would be more easily testable code and secondarily more readable/modifyable code (say, if we find a need to create different resources later). The downside is introducing new bugs :)

There are a few "weird" things about this process that make it not a "normal" state machine:

  • state is both in the Azure API and in the workers DB table
  • some things can happen in parallel (disks) while others must be serialized (ip -> nic -> vm).
  • there are no "events" that transition from state to state -- we just poll the API

So, I think implementing this as a "normal" state machine with functions for each "state transition" would be difficult and not result in something especially readable or testable.

The current implementation of provisionResources reads something like

provision a NIC
if not done, return
provision an IP
if not done, return
provision a VM
if not done, return
on error, switch to removing the worker

which is a pretty good summary of what's happening. There are some callbacks and functions to modify the worker's provisionerData that don't read too clearly.

Another structure I could think of to handle this would be to set up nested function calls and use an exception to signal that execution is blocked and should be retried on the next poll. Something like

fn provisionResources():
  try:
    createVm()
  on error = Blocked:
    return
  on any other error:
    removeWorker

fn createVm():
  if providerData.vm is complete:
    return vm
  if providerData.vm indicates creation is in progress:
    if Azure API says creation is complete:
      get information on disks and store it in providerData.vm
      return vm
    else:
      raise blocked
  nic = createNic()
  call Azure API to create the VM with 'nic', store the operation in providerData.vm
  raise blocked

fn createNic():
  if providerData.nic is complete:
    return nic
  if providerData.nic indicates creation is in progress:
    if Azure API says creation is complete:
      return nic
    else:
      raise blocked
  ip = createIp()
  call Azure API to create the NIC with that IP, store the operation in providerData.nic
  raise blocked

fn createIp():
  if providerData.ip is complete:
    return ip
  if providerData.ip indicates creation is in progress:
    if Azure API says creation is complete:
      return ip
    else:
      raise blocked
  call Azure API to create the IP, store the operation in providerData.ip
  raise blocked

I'm not sure this is clearer than what we have! Also, the exception trick won't work too well with parallel operations. And, this is definitely no easier to test than what we have.

I'll think about this some more, but maybe what we have is good enough? Ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
3 participants