Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066

SteveLeach-Keytree · 2021-08-19T16:47:54Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

AKS feature aks-engine#1276 made the cordon drain timeout configurable, but this attribute does not yet appear to be exposed through the azurerm provider.

We have some pods in a deployment that take a long time to start up as they load all the data they need. We generally have 4 pods in the ReplicaSet and a minimum of 2 (defined in a pod disruption budget). During an AKS upgrade we need to allow time for 2 new pods to start up on new nodes before terminating the last 2 on the old nodes, but the cordon & drain times-out before they are ready.

New or Affected Resource(s)

azurerm_kubernetes_cluster_node_pool

Potential Terraform Configuration

resource "azurerm_kubernetes_cluster_node_pool" "blue_node_pool" {
  name                  = "bluepool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.k8s.id
  vm_size               = var.vm_size
  node_count            = var.node_pool_count
  os_disk_size_gb       = var.os_disk_size
  vnet_subnet_id        = azurerm_subnet.k8s_subnet.id
  orchestrator_version  = var.k8s_version

  enable_auto_scaling   = false
  enable_node_public_ip = false
  max_pods              = 110

  timeouts {
      drain = 40
  }

  lifecycle {
    ignore_changes = [
      node_count
    ]
  }
}

References

feat: Make cordon drain timeout configurable with --upgrade Azure/aks-engine#1276

aristosvo · 2021-08-20T07:21:40Z

@SteveLeach-Keytree Do you know if this feature has made it into AKS the AKS API already? I can't find it.

If not, I recommend to first file this issue upstream for the AKS team to expose it in the API.

SteveLeach-Keytree · 2021-08-20T08:05:50Z

@SteveLeach-Keytree Do you know if this feature has made it into AKS the AKS API already? I can't find it.

If not, I recommend to first file this issue upstream for the AKS team to expose it in the API.

Is that not what the issue I linked in the reference was? I'm not entirely sure I fully understand what all the different projects are.

aristosvo · 2021-08-20T09:55:41Z

@SteveLeach-Keytree Don't worry, there is a lot indeed 😃 aks-engine stuff might end up in AKS, but in general a feature in aks-engine is not by default exposed in AKS.

If there is no documentation or API feature we can use, you should ask it to the AKS team first.

SteveLeach-Keytree · 2021-08-20T10:07:51Z

@SteveLeach-Keytree Don't worry, there is a lot indeed 😃 aks-engine stuff might end up in AKS, but in general a feature in aks-engine is not by default exposed in AKS.

If there is no documentation or API feature we can use, you should ask it to the AKS team first.

That's what I'm saying - I linked the AKS team's change that makes this possible.

aristosvo · 2021-08-20T10:56:33Z

I understand what you are saying, but you are pointing towards the change in AKS engine. The Terraform provider for Azure interacts with Azure AKS API's which are a wrapper around the service which uses probably some version of your mentioned AKS engine.

We have to work with that API, as does the Azure CLI. The functionality you are interested in is AFAIK not exposed via the Azure AKS API. I cannot find it in the Azure CLI docs either.

To get more technical detailed, this link points towards the Go models which we can use to configure the AKS node pool for Terraform, I cannot find it over there as well. To get the feature exposed or more information how it possibly can be done you could best file an issue here.

SteveLeach-Keytree · 2021-08-20T14:42:17Z

Ah, got you now. I'd assumed that the AKS team were the people maintaining the AKS engine and was getting very confused.
I'll see if I can get that raised and link it.

SteveLeach-Keytree · 2021-08-20T14:56:01Z

Azure/AKS#2508

BenjaminHerbert · 2023-02-22T11:14:50Z

Just FYI: the issue was closed on the AKS side due to inactivity 🤷

cailyoung · 2023-11-29T05:53:09Z

Regardless of the AKS issue, it appears that at some point this arrived in the REST API: https://learn.microsoft.com/en-us/rest/api/aks/agent-pools/get?view=rest-aks-2023-08-01&tabs=HTTP#agentpoolupgradesettings

cailyoung · 2023-11-29T22:54:19Z

@stephybun @katbyte @aristosvo apologies for the mentions; but I think this issue could progress now, given that the API appears to support it now.

norm-dole-dbtlabs · 2023-12-08T17:43:08Z

Code in the original submission was before api came out:

This needs to be under upgrade_settings with max_surge to match Azure settings logically for node-pools.

resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-${var.namespace}-${var.environment}"
location = var.region
resource_group_name = var.resource_group_name
default_node_pool {
name = var.default_node_pool_name
node_count = var.k8s_node_count
vm_size = var.k8s_node_size
temporary_name_for_rotation = "tempvmsize"
vnet_subnet_id = var.private_subnet_id

upgrade_settings {
max_surge = 3
drain_timeout = 1440 ##### new section #############
}

norm-dole-dbtlabs · 2024-01-02T15:18:08Z

This value will also need to be respected when setting a temp-pool-name for upgrade. Meaning currently if you change VM size of AKS cluster in TF you give it a temp-node_pool-name and it creates one and moves your workload over and then re-creates the default-node-pool. the drain_timeout should be the same for the temp-pool and the re-created default-pool as the original custom setting rather than just set to the default 30min

aristosvo · 2024-01-02T19:44:30Z

I'll take a quick look on how hard this will be to implement, will let you know ASAP.

From the first look of things it seems we need to update azurerm_kubernetes_cluster to a newer API version (this one), other than that it seems like a minor change. upgrade_settings.max_surge is already there, so it looks like just adding drain_timeout in that same code block would suffice after the SDK upgrade.

akrock · 2024-02-29T15:57:32Z

@aristosvo It looks like the dependencies listed in the PR have been potentially resolved now. Is there anything that is blocking the attached PR from moving forward?

dkirrane · 2024-03-01T11:55:50Z

Could you add support for soak setting also?

So all upgrade_settings as per API?

maxSurge
drainTimeoutInMinutes
nodeSoakDurationInMinutes

zioproto · 2024-03-25T19:12:58Z

This is today possible on AKS and the features are all GA:
https://learn.microsoft.com/en-US/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade

Generally Available: AKS support for node soak duration for upgrades
https://azure.microsoft.com/en-gb/updates/generally-available-node-soak-duration-for-upgrades/

@stephybun @ms-henglu we need to have this in the provider. We already have a issue for the AKS module Azure/terraform-azurerm-aks#530

itay-grudev · 2024-04-26T08:36:06Z

Without the:

--drainTimeoutInMinutes
--node-soak-duration

parameters, we are currently experiencing downtime during automatic node upgrades. Nodes are upgraded in a very rapid succession before services inside had a chance to recover.

I would call this a really high priority problem as it genuinely leaves services offline for a while. Could you please focus some efforts on this?

norm-dole-dbtlabs · 2024-04-26T13:01:05Z

Hahah yeah...come on IBM let's get this done; it should have been updated in November 2023

aristosvo · 2024-05-15T12:15:03Z

I'll take a quick look. Last time it was blocked on Microsoft's API behavior, which was reported here but which triggered no response thus far.

@norm-dole-dbtlabs Not cool, I'm an independent contributor, nobody at the Terraform nor HashiCorp side can be blamed for anything in this regard.

stephybun · 2024-05-15T12:21:08Z

@zioproto is this something you could help with by bringing it to the attention of the AKS team?

aristosvo · 2024-05-15T14:14:26Z

I've brought the PR up to date and completed stuff like docs.

The issue is still there, resetting the upgrade settings to the defaults implicitly is therefore not included in the current implementation. Whether that is an issue is up to the reviewers.

Support for node_soak_duration_in_minutes is left out of scope, as that requires an update to a newer API version.

norm-dole-dbtlabs · 2024-05-17T14:02:24Z

@aristosvo that IBM shot wasn't at you; I think it's noble that your doing this work but it's also the responsibility of Hashicorp/Microsoft through whatever partnership they have to keep this up to date in real time and not hope community members shore up their enterprise support best effort

SteveLeach-Keytree mentioned this issue Aug 20, 2021

Expose node pool cordonDrainTimeoutInMinutes via the API Azure/AKS#2508

Closed

katbyte added enhancement good first issue service/kubernetes-cluster labels Sep 10, 2021

stephybun added sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Feb 15, 2022

cailyoung mentioned this issue Dec 7, 2023

azurerm_kubernetes_cluster & kubernetes_cluster_node_pool: upgrade_settings add node_drain_timeout #24137

Closed

1 task

aristosvo mentioned this issue Jan 2, 2024

Add ContainerService/2023-10-01 hashicorp/pandora#3556

Merged

aristosvo added sdk/requires-upgrade This is dependent upon upgrading an SDK and removed upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR sdk/not-yet-supported Support for this does not exist in the upstream SDK at this time labels Jan 2, 2024

mbfrahry closed this as completed in hashicorp/pandora#3556 Jan 3, 2024

aristosvo reopened this Jan 4, 2024

aristosvo mentioned this issue Jan 13, 2024

azurerm_kubernetes_*: add drain_timeout_in_minutes to upgrade_settings #24491

Closed

4 tasks

bitcloud mentioned this issue May 19, 2024

azurerm_kubernetes_cluster azurerm_kubernetes_cluster_node_pool - Adding node_soak_duration_in_minutes and drain_timeout_in_minutes to the node pool upgrade_settings block #26022

Closed

14 tasks

stephybun mentioned this issue May 29, 2024

azurerm_kubernetes_cluster/azurerm_kubernetes_cluster_node_pool - add drain_timeout_in_minutes and node_soak_duration_in_minutes #26137

Merged

4 tasks

stephybun closed this as completed in #26137 May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066

Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066

SteveLeach-Keytree commented Aug 19, 2021

aristosvo commented Aug 20, 2021 •

edited

SteveLeach-Keytree commented Aug 20, 2021

aristosvo commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

aristosvo commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

BenjaminHerbert commented Feb 22, 2023

cailyoung commented Nov 29, 2023

cailyoung commented Nov 29, 2023

norm-dole-dbtlabs commented Dec 8, 2023

norm-dole-dbtlabs commented Jan 2, 2024

aristosvo commented Jan 2, 2024 •

edited

akrock commented Feb 29, 2024

dkirrane commented Mar 1, 2024

zioproto commented Mar 25, 2024

itay-grudev commented Apr 26, 2024 •

edited

norm-dole-dbtlabs commented Apr 26, 2024 •

edited

aristosvo commented May 15, 2024

stephybun commented May 15, 2024

aristosvo commented May 15, 2024 •

edited

norm-dole-dbtlabs commented May 17, 2024

Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066

Support for AKS node pool cordonDrainTimeoutInMinutes parameter #13066

Comments

SteveLeach-Keytree commented Aug 19, 2021

Community Note

Description

New or Affected Resource(s)

Potential Terraform Configuration

References

aristosvo commented Aug 20, 2021 • edited

SteveLeach-Keytree commented Aug 20, 2021

aristosvo commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

aristosvo commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

SteveLeach-Keytree commented Aug 20, 2021

BenjaminHerbert commented Feb 22, 2023

cailyoung commented Nov 29, 2023

cailyoung commented Nov 29, 2023

norm-dole-dbtlabs commented Dec 8, 2023

norm-dole-dbtlabs commented Jan 2, 2024

aristosvo commented Jan 2, 2024 • edited

akrock commented Feb 29, 2024

dkirrane commented Mar 1, 2024

zioproto commented Mar 25, 2024

itay-grudev commented Apr 26, 2024 • edited

norm-dole-dbtlabs commented Apr 26, 2024 • edited

aristosvo commented May 15, 2024

stephybun commented May 15, 2024

aristosvo commented May 15, 2024 • edited

norm-dole-dbtlabs commented May 17, 2024

aristosvo commented Aug 20, 2021 •

edited

aristosvo commented Jan 2, 2024 •

edited

itay-grudev commented Apr 26, 2024 •

edited

norm-dole-dbtlabs commented Apr 26, 2024 •

edited

aristosvo commented May 15, 2024 •

edited