Skip to content

Latest commit

 

History

History
618 lines (468 loc) · 26.7 KB

README.md

File metadata and controls

618 lines (468 loc) · 26.7 KB
authors contributors state
Pedro Palazón Candel <pedro@joyent.com>, Trent Mick <trent@joyent.com>
Robert Mustacchi <rm@joyent.com>, Joshua Clulow <jclulow@joyent.com>
draft

RFD 3 Triton Compute Nodes Reboot

Introduction

A new sdcadm experimental reboot-plan collection of commands (unpromised interface) for working towards controlled and safe reboots of selected servers in a typical Triton (formerly SmartDataCenter or SDC) setup.

One of the least specified and hardest parts of Triton upgrades right now is managing the reboots of CNs and the headnode safely. In particular:

  • controlling reboots of the "core" servers (those with Triton core components, esp. the HA binders and manatees)
  • reasonably helpful tooling for rebooting (subsets of) the other servers in a DC: rolling reboots, reboot rates

The list of desired subcommands should be able to handle:

  • Creation (and eventually queue) of a new reboot plan
  • Check status of currently queued/in-progress reboot plan
  • Historical details of reboot plans already executed

Terminology

How is a server reboot executed in Triton?

In order to reboot a Triton server, we execute the following CNAPI request

sdc-cnapi /servers/<UUID>/reboot -X POST

which will result in the creation of a reboot job (whose uuid is retrieved as part of the request).

What does the reboot job do?

The reboot job will send the server the reboot message (exit 113) and set the server's transitional_status property to rebooting.

Note that the job does not wait for reboot completion.

How is the completion of a server reboot checked?

We check for server reboot completion by looking at the value of the status property of CNAPI's server object. Obviously, this requires CNAPI to be up and running, and this approach cannot be used when, for example, we reboot a server holding a manatee shard's primary member.

In this case - and in general, on any case involving a CN hosting a manatee shard member, the better way to poll for server reboot completion is by looking at the manatee shard state itself using manatee-adm show from any of the other manatee shard members.

Reboot plan

It is proposed to use a "reboot plan" similar to the upgrade plans already generated by sdcadm. This plan will contain the following information:

  • name of reboot plan (optional)
  • start time for reboot plan (could be immediate at first pass)
  • list of servers to reboot (uuids)
  • maximum number of servers that can be offline at once
  • maximum time each server should be offline before an alarm is triggered
  • state of the reboot plan. One of created, stopped, running, canceled, complete, perhaps others. ...

At the moment of the creation of the reboot plan, the UUID that identifies the reboot plan as it's stored will be generated, in a way that we can refer to the reboot plan in the future.

The reboot plan should also store historical data about how long each server was offline, so that we can make better estimates of downtime in future plans.

This will be done individually for each server, in a way that dealing with many servers - order of 1000s - will be possible, and simplify result updates for completion of the reboot of each server.

Execution of the reboot plan

Execution of the reboot plan consists of the creation of the plan, queuing of the different servers to reboot at the provided concurrency, and the verification of the server reboot, either through CNAPI for servers not hosting core Triton components, or through a manatee instance for servers hosting core components.

Creation, modification and retrieval of the reboot plan information:

The creation of the plan will consist of a POST request to a given API end-point (POST /reboot-plans), which will result in a 201 Created response, together with the plan UUID.

Subsequent retrievals of the plan will be done using GET /reboot-plans/<uuid>

Updates of the reboot plan by the process executing it could be done using PUT /reboot-plans/<uuid>.

Logical storage for these plans seems to be Moray through CNAPI, given it is the application in charge of anything related to Servers. This has the drawback of updating the plan started/finished members for core servers when Manatee is down. However, we would make the reboots sequentially for these core servers, so we could update both started_at and finished_at for any of the CNs if we wait for manatee shard availability for these CNs.

Execution of the plan

Execution of the reboot plan needs to be driven in a way that a failure or interruption on the process running the plan will not cause the plan to be incomplete.

Therefore, the process in charge of creating/running reboot plans will first check for the existence of a reboot plan whose state isn't "finished", and continue with that one or, optionally, ask the user to cancel such plan before attempting the execution of a new one (TBD).

Once the new plan is created, this process will begin either with the reboot of the first core server, when dealing with CNs hosting Triton core components, or with the reboot of the first batch of CNs not hosting core components.

The process will first check Workflow API to verify that the reboot jobs have been successfully created and executed, and then will poll either CNAPI for reboot status of CNs, or use manatee-adm to check for the state of the manatee Triton shard when rebooting CNs with core components.

This routine will be repeated until all the CNs hosting Triton core members have been rebooted, or until we complete the reboot of all the CNs not hosting Triton core instances, at the provided concurrency.

As specified above, every time a server reboot has been completed, the process will update the reboot plan object using PUT requests to the reboot plan URI.

Reboot of servers hosting core and non-core components

CNs with core components need to be rebooted before any other CNs, and using a sequence established by manatee's shard administration. The main problem would be to keep track of the reboot of these core CNs in terms of how long it took to reboot each one of these nodes and, of course, if we drive such CNs from the sdcadm process itself, we will not be able to schedule reboot of CNs hosting core Triton components. Which I'd say has sense, since a failure in the reboot of these CNs will likely compromise the state of the whole Triton setup and, therefore, should be an attended operation.

The proposed solution is to always reboot first CNs hosting any Triton core component - even if those are not members of a manatee shard, or binder's cluster of ZooKeepers - for example, an imgapi instance.

Reasoning: In many cases we will be running the reboot plan right after core service upgrades. It would be nice to get out of maint soon and get into "the system itself is up to latest"; you can now reboot non-core CNs at leisure.

Which process should be in charge of executing the reboot plan

Option 1: CNAPI

Given that the reboot plan should also handle reboots of the "core" servers, we'll need to add a bunch of manatee-adm related logic - already present in the sdcadm code base - to CNAPI.

CNAPI itself is manatee's dependent and trying to handle reboots of the CNs hosting manatee's shard members could give us problems. In fact, the way we use to check for a successful reboot of core CNs is checking the status of the manatee shard from one of the available shard members, while we check the availability of the non-core CNs rebooted using CNAPI (and manatee).

At the moment, CNAPI doesn't have any code useful to deal with schedule of the execution of a reboot plan or anything else. It's merely the HTTP server process.

Additionaly, eventual HA of CNAPI would make multiple instances have to elect a leader here, which also makes us to lean against this option.

Option 2: Workflow Runner

In order to make possible deciding when a reboot plan should execute, we could possibly use node-workflow, and create a job with the servers to be rebooted. There would be a workflow, which could be used by CNAPI in order to "queue" the reboot of the selected servers. It would receive the servers to be rebooted and the desired time to begin with the plan execution as arguments. It could even poll the servers for successful reboot and eventually fire an alarm when the reboot takes too long.

Currently, workflow is unable to run parallel tasks and we should drive the concurrent reboot of CNs through a monolithic workflow task.

It also has the drawback that any outage of wf-runner or moray would result in the workflow job being cancelled before the wf-runner process is restarted.

Option 3: SdcAdm

sdcadm itself should be able to drive the process if we make sure we add to the subcommands in charge of creating the reboot processes the ability to check if there's a reboot command not yet finished for the cases when the sdcadm process exits during the course of the reboot process.

That's to say: if we issue sdcadm reboot --concurrency=10 to an Triton setup with 30 servers, and the sdcadm process exits in the middle of the reboots, we should make sure that the next invocation of this command, whatever the provided arguments would result in the ability to ask the user about a reboot plan not yet finished, and the desire or not of resuming such reboot plan, or just flag it as finished and beginning a new one.

The main drawback of this approach would be that it wouldn't be able to queue the execution of the reboot plan, and it would need to be executed immediately.

Option 4: New sdcadm-rebooter service, running in HN GZ

Another option to address this queue problem could be to create a new sdcadm- related service, which would be in charge of executing the reboot processes, instead of doing such thing straight from sdcadm CLI at the moment of typing the command.

Having an idling node process wasting memory when there is no reboot-plan doesn't sound like the best possible idea. Instead, we could have a transient service which only lives for as long as it takes to find a current reboot plan and run it. sdcadm reboot-plan run could start it at the moment that a new reboot plan is created. And being a transient service would make possible for it to re-attach to an existing reboot plan after a reboot of the Headnode, once the core services have come up.

Failure handling

A failure on the reboot of one or more nodes included in a reboot plan should make the whole plan to fail or, at least, to to report that the reboot of these nodes failed, and allow the operator to choose if continuing is really an option, or the remaining reboot plan should be canceled.

Maybe the reason for the failure(s) of one ore more nodes (when reboots happen concurrently) is that a platform image won't boot, and we don't want to attempt reboots of more CNs using it.

Proposal for reboot coordination and the CLI

Scenario A: 1 HN, 9 CNs (2 core, 7 non-core). Rebooting everything.

(Will skip the "experimental" part in examples).

$ sdcadm reboot-plan --help
... help output ...
Commands:
    create      Create a reboot plan.
    run         Execute/continue the reboot plan.
    status      Show status of the current reboot plan.
    watch       Watch (and wait for) the current running reboot plan.
    stop        Stop execution of the current (running) reboot plan.
    cancel      Cancel the current reboot plan.

Create the reboot plan:

$ sdcadm reboot-plan create --all -W
Warning: The following servers will reboot without a platform change
(use '--skip-current' to exclude servers already on target boot platform):
    $server_hostname ($server_uuid)
    ...

Warning: The following servers will reboot with a platform *downgrade*:
    $server_hostname ($server_uuid)
    ...

Created reboot plan $uuid (10 servers, max concurrency 3):
    Reboot headnode: platform $platold1 -> $platnew1
    Reboot 2 core servers: platform $platold2 -> $platnew1
    Reboot 4 servers: platform $platold3 -> $platnew1
    Reboot 1 server: platform $platold2 -> $platnew2
    ...

Notes:

  • The summary shows a separate line for each ($old, $new) platform tuple. A single line for each server is (a) too much and (b) less useful.
  • Warnings about possible mistakes are highlighted. After all warnings are emitted, this errors out. Use '-W,--ignore-warnings' to still create the reboot plan. Other specific option to workaround specific warnings.

Run the plan and optionally --wait for it to complete (with progress info). Optionally can give the reboot-plan UUID (to guard against another operator slipping in a different plan).

# Usage: sdcadm reboot-plan run [--wait] [--yes] [UUID]
$ sdcadm reboot-plan run --wait
This will run a plan to reboot 10 servers (rebooting a maximum of 3
servers at a time). Details:
    Reboot headnode (platform $platold1 -> $platnew1)
    Reboot 2 core servers (platform $platold2 -> $platnew1)
    Reboot 4 servers (platform $platold3 -> $platnew1)
    Reboot 1 servers (platform $platold2 -> $platnew2)
    ...

    Warning: This plan includes a reboot of the headnode, which will
        terminate this login session. Reboots will *continue*. Use
        `sdcadm reboot-plan watch` to re-attach after the headnode
        reboots.

Would you like to continue? [y/N] y

Running reboot plan $uuid (use `sdcadm reboot-plan watch` to re-attach)
...

This is where sdcadm hands off to the equivalent of sdcadm reboot-plan watch. We get a progress bar and status messages for the plan:

...
Reboot plan $uuid  [=====>                    ]  $progbar_stuff

Progress messages, for core servers. Note: I'm proposing here that 'sdcadm reboot-plan run' not consider moving on from a core zone until all core services are up on that server.

...
Rebooting core server $hostname: platform $current_platform -> $boot_platform
Rebooted core server $hostname: $start - $end (3m10s)
Wait for core services to come online on server $hostname

Progress messages, for non-core servers:

...
Rebooting server $hostname: platform $current_platform -> $boot_platform
Rebooted server $hostname: $start - $end (3m10s)

When complete:

...
Completed reboot plan $uuid (rebooted 10 servers in 1h20m15s)

Ctrl+C'ing doesn't stop the reboots. That might not be clear to a user trying to cancel part way through. So we should trap that and try to be clear:

...
Running reboot plan $uuid (use `sdcadm reboot-plan watch` to re-attach)
Reboot plan $uuid  [=====>                    ]  $progbar_stuff
^C
Stopped watching reboot plan $uuid
**Note: The reboot plan is still running! Use `sdcadm reboot-plan stop`
to stop it.**

Done!

The other commands (see the 'sdcadm reboot-plans -h' output above) are self-explanatory for now.

Affected APIs.

  • CNAPI gets two buckets to store reboot plan info:

      bucket cnapi_reboot_plans
          uuid (UUID)
          concurrency (Number)
          state (String)          # Note not "execution" please.
                                  # Values are: "created", "stopped",
                                  #   "running", "canceled", "complete".
                                  #   Perhaps others.
          ...
    
      bucket cnapi_reboots
          reboot_plan (UUID)      # or "reboot_plan_uuid" if we want to keep using that pattern
          server_uuid (UUID)
          server_hostname (String)  # maybe include this to make it nicer for reporting
          started_at              # debatable on naming and how to store date for timestamps
          finished_at             #    name: started_at vs started vs started_timestamp
                                  #    type: integer ms vs toISOString()
          job_uuid (UUID)
          ...
    

    The reason for two buckets is to be able to cope with many many servers, e.g. on the order of 1000s.

  • New CNAPI endpoints to control this:

      RebootPlanCreate (POST /reboot-plans)
      RebootPlanGet (GET /reboot-plans/:uuid)
      RebootPlanGetActive (GET /reboot-plans/active)
      RebootPlanUpdate (PUT /reboot-plans/:uuid)
    
  • Modify CNAPI /servers/<uuid>/reboot to accept ?reboot_plan=$reboot_plan_uuid argument, used to initialize the reboot job for servers when there's a reboot plan in progress, and prevent the failure response which otherwise the reboot end-point would return as part of the mechanism to prevent manual reboots while the reboot plan is in progress.

  • sdcadm gets a new sub-class to handle all the RebootPlan CLI associated set of sub-commands.

  • The new sdcadm-rebooter transient service needs to be added to sdcadm too

  • Server selection can be done using sdc-server lookup, as it's done for other sdcadm sub-commands.

    [root@headnode (coal) ~]# sdcadm experimental reboot help core
    Reboots Compute Nodes hosting manatee shard instances.
    
    Usage:
         sdcadm reboot-plan create [ -a ] [ -c ] [ -o ] [ -h ] [-n ] [ -y ] \
         [ -W ] [ -s ] [ -r ] [SERVER] [SERVER]...
    
    Options:
        -y, --yes             Answer yes to all confirmations.
        -h, --help            Show this help.
        -n, --dry-run         Go through the motions without actually rebooting.
        -r INT, --rate=INT    Number of servers to reboot simultaneously. Default: 5.
        -W,--ignore-warnings  Create the reboot plan despite of emiting warnings for
                              servers already on the target platform (or other
                              warnings).
        -s, --skip-current    Use it to skip reboot of servers already on target
                              boot platform.
    
    Server selection:
      -c, --core              Reboot the servers with Triton core components.
                              Note that this will include the headnode.
      -o, --non-core          Reboot the servers without Triton core
                              components.
      -a, --all               Reboot all the servers.
    
    (Maybe add an option to allow skipping headnode reboot?)
    
    Use "--all" to reboot all the non-core setup servers or pass a specific set
    of SERVERs. A "SERVER" is a server UUID or hostname. In a larger datacenter,
    getting a list of the wanted servers can be a chore. The
    "sdc-server lookup ..." tool is useful for this.
    
    Examples:
    
        # Reboot all non-core servers.
        sdcadm reboot-plan create --non-core
    
        # Reboot non-core setup servers with the "pkg=aegean" trait.
        sdcadm reboot-plan create \
            $(sdc-server lookup setup=true traits.pkg=aegean | xargs)
    
        # Reboot non-core setup servers, excluding those with a "internal=PKGSRC" trait.
        sdcadm reboot-plan create \
            $(sdc-server lookup setup=true 'traits.internal!~PKGSRC' | xargs)
    

extras

Some extra work which should be a part of this RFD. We might want to prioritize these and do only some of them.

  1. DAPI should prefer rebooted servers. While we are rolling reboots across a chunk of the fleet, it would be nice to be able to bring the DC out of maint as soon as possible. We might want to do this after a subset of all the DCs servers are rebooted. It would then be nice if DAPI would prefer already rebooted servers. I think it should be a soft preference only.

  2. Avoid rebooting servers with active CNAPI tasks or "inflight" provisions. We should avoid rebooting a CN while there is a CNAPI task running on that CN or an "inflight" provision assigned to that CN. Are there long-running tasks that we should ignore? E.g. a hanging provision? a super long image creation? Is there an active CNAPI task for a docker logs -f? If there is a conflict, the API/CLI should report those and give the operator an override. This logic could be handled in the CNAPI ServerReboot endpoint.

  3. Consider a need for a way to "drain" a server. What if there are constant tasks coming to a server (e.g. someone snapshotting or restarting vms constantly)... it would be nice to have a per-server way to drain tasks from it. I think we probably don't need/want this complexity right now.

  4. When going through planned reboots... should we take note of the current_platform at creation time, and compare at reboot time. If they don't match... then consider options. It may have been manually rebooted already. Perhaps warn, skip it. 'state=skipped'.

  5. I'd like if CNAPI "/reboot-plans/$plan" supported a short uuid (e.g. the first chunk) natively. That would allow us to use the short uuid prefix in status messages.

  6. CNAPI's ServerReboot check for active reboot-plan. If there is one, it bails if ServerReboot wasn't given the UUID. I.e. guard against manual reboots during a running reboot-plan.

  7. Are there other warnings that sdcadm reboot-plan create should include? Ask ops/support what might surprise. E.g. warn on unsetup CNs? CNs that are down? CNs in any state other than "setup"?

  8. Get a one-liner:

     sdcadm reboot-plan create --all --run --wait
    

    which should mostly just be create handing off to run handing off to wait.

  9. Consider what a manual reboot process could be. I.e. Say the operator wants to have the last word on when to reboot the next box. Instead of sdcadm reboot-plan run (or additionally to), we could have sdcadm reboot-plan step or next (like a debugger) to have it do the next step. This would confirm the next step:

     $ sdcadm reboot-plan next
     Reboot plan $uuid (10 servers, max concurrency 3)
     Next step: reboot 3 servers
         server $hostname1: $current_platform -> $boot_platform
         server $hostname2: $current_platform -> $boot_platform
         server $hostname3: $current_platform -> $boot_platform
    
     Would you like to continue? [y/N]
    

    and so on, until complete.

  10. Add support for handling a (non-core) server that takes too long to reboot. With the current plan, i.e. doing nothing, we'd end up rebooting all the other servers, and polling endlessly for that last server to come up. Might be the same behaviour if Ur crashes on that server. Better might be to stop reboots of other servers if hitting that issue, e.g. what if a new platform image won't boot. Naively we could wipe out the entire DC.

  11. What to do if core services don't come back on a core server. If it takes a long time to (manually) get those services back up... how surprised will we be if reboots then start happening again? Should that process go stale after, say, an hour? Should one be able to see that a reboot-plan is active even when core systems are down? E.g. sdcadm could have a cache of latest known state in is /var area even if manatee is down (because of a manatee server update gone wrong).

  12. Consider starting the long wanted sdcadm status to give a quick status of the DC. Near the top of the list should be: - currently in maint? - is there a reboot plan running?

  13. PhillipS requests "an option to reboot only empty CNs", where in his experience they've had to query zoneadm list on CNs instead of vmadm because "problems in the past where tools like vmadm were too old to see the running instances". Perhaps something like this:

    $ sdcadm reboot-plan create \
        $(sdc-server lookup ...someway to query empty servers... | xargs)
    

Implementation Details

cn-agent

  • Needs a way to keep track of currently in-progress tasks.
  • Needs to provide a mechanism to stop accepting new tasks until a new order is given. This could be used by update_agent task, which isn't currently checking if the execution of the self-update task would mean the interruption of any other tasks in progress. Right now, neither server reboot or self update cn-agent care at all about in-progress tasks.

Both could be new tasks, as follows:

  • tasks_in_progress
  • pause_task_handler
  • resume_task_handler

"In progress" tasks could be just saved in memory, together with a counter.

Additionally, the missing 'show_tasks' task should be added in order to fix the already documented CNAPI end-point ServerTaskHistory (GET /servers/:server_uuid/task-history), which is returning an internal error exception now.

CNAPI

  • Provide end-point to report running tasks for a given server GET /servers/:server_uuid/running-tasks.
  • Create cnapi_reboot_plans and cnapi_reboots moray's buckets.
  • Create models for aforementioned buckets.
  • Modify POST /servers/:server_uuid/reboot to accept an optional reboot_plan=$reboot_plan_uuid argument.
  • Modify POST /servers/:server_uuid/reboot to return an error response (409 Conflict) for the case somebody attempts a reboot of a server currently included into a reboot_plan in progress.
  • Modify POST /servers/:server_uuid/reboot in a way that it will wait for all the tasks being run by the server's cn-agent before really sending the reboot. Could be done by modifying the workflow to include a call to pause_task_handler as its first step, then include calls to tasks_in_progress until we get a zero length response. This would require end-point also for "pause_task_handler", since the only way for workflow to do such thing is through CNAPI end-points, cannot talk straight to CN.
  • Provide GET /reboot-plans end-point to make possible for ops to get a list of reboot plans.
  • Provide POST /reboot-plans end-point to support creation of a new reboot plan.
  • Provide GET /reboot-plans/:reboot_plan_uuid to get details of the given reboot plan.
  • Provide PUT /reboot-plans/:reboot_plan_uuid to allow updates of the reboot plan. The different actions like "stop", "run/continue" or "next" if we implement the step based execution, will result into sdcadm-rebooter service updating either the reboot plan itself, or the associated reboots for each server included into the reboot plan with the start/finish times for the different reboots.

sdcadm

  • New RebootPlan CLI class, including sub-commands for: create, run, status, watch, stop, cancel and next, if we decide to go ahead with implementation of execution of the reboot plan in steps requiring confirmation.
  • New sdcadm-rebooter HN GZ service, including transient SMF manifest.
  • Required modifications to allow re-use of shared logic between this service and the sdcam CLI. (This might be complex and require some extra time).
  • Logic to figure out "core" servers, manatee's shard mgmt. (shared with sdcadm up manatee), ...