Unexpected Node Pool scaling on unrelated value change #2249

eduardOrthopy · 2023-08-03T07:14:36Z

Terraform provider version

provider registry.terraform.io/opentelekomcloud/opentelekomcloud v1.35.4

Affected Resource(s)

opentelekomcloud_cce_node_pool_v3

Terraform Configuration Files

upon request

Debug Output/Panic Output

None

Steps to Reproduce

Minimal reproduction

Setup a CCE Cluster with one node pool and autoscaling enabled, set the initial_node_count to something bigger than the minimal node count
Wait for the auto-scaler to scale down the node pool
Add a k8s_tag, or perform any modification, on the node pool
terraform apply

Expected Behavior

Either: Only the modifications made are applied
Or at least: All modifications are shown in the output of apply or plan

Actual Behavior

The node pool is scaled to initial_node_count. This is not shown in the output of plan and happens regardless of whether the initial_node_pool property is part of the ignored properties or not.

Important Factoids

The change is not displayed in the plan, regardless of ignore options
If the scaling is not up, as in the minimal example above, but down, this is very likely to cause outages

References

GH-1961 already references this issue. It was closed with the comment that this is expected and a proposed workaround, to basically be careful when touching those resources.

Remarks

If this is the wrong issue tracker to get this addressed, please let me know.
From a user/customer perspective, the solution to Resizing node pool with terraform causes unwanted scale-in #1961 not a very satisfactory conclusion. I would ask you to at least display the upcoming change in the plan.
At least the naming initial is misleading in my mind. Maybe desired would be more indicative of the fact that this value is more than a 'set once' thing.

The text was updated successfully, but these errors were encountered:

anton-sidelnikov · 2023-08-29T10:56:36Z

@eduardOrthopy Hi, could you share more detail better with tf configs? I cannot reproduce this issue, my nodes not scaled down if initial_node_count > min_node_count. Also not possible to rename now initial_node_count to desired or anything else because terraform sdk does not support aliasing for attributes. And renaming of required parameter will lead changes in all configurations.

eduardOrthopy · 2023-09-01T06:49:03Z

@anton-sidelnikov Thanks for looking into this. I can get something together, but it will be the week after next. Did you wait for the autoscaler to change the node count before running the update?

Regarding renaming, ok. I understand that this would be a breaking change. Still might be worth considering should you ever plan a major version release.

anton-sidelnikov · 2023-09-01T08:55:17Z

@eduardOrthopy yes I did a lot of different checks, but no luck. Of course we can do it together, just ping me when you ready.

eduardOrthopy · 2023-09-06T05:09:56Z

@anton-sidelnikov I was just about to start working on a minimal reproduction template, and I re-read your reply.
You said that

my nodes not scaled down if initial_node_count > min_node_count.

Do you mean that the autoscaler did not work for you?

That would be a different issue. I am talking about an issue with terraform apply after the autoscaler did adapt the node pool, i.e. removed some nodes because they are idle.

I have set my node pool to 7 nodes initially with a minimum of 3 and an autoscaler config of (Please adapt the redacted and region values to your test region):

resource "opentelekomcloud_cce_addon_v3" "autoscaler" {
  template_name    = "autoscaler"
  template_version = 1.23.17
  cluster_id       = opentelekomcloud_cce_cluster_v3.cluster.id

  values {
    basic = {
      "cceEndpoint" = YourRegionEndoint
      "ecsEndpoint" = YourRegionEndoint
      "region"      = YourRegion
      "swr_addr"    = Redacted
      "swr_user"    = Redacted
    }
    custom = {
      "cluster_id"                     = opentelekomcloud_cce_cluster_v3.cluster.id
      "tenant_id"                      = data.opentelekomcloud_identity_project_v3.current.id
      "coresTotal"                     = 16000
      "expander"                       = "priority"
      "logLevel"                       = 4
      "maxEmptyBulkDeleteFlag"         = 11
      "maxNodesTotal"                  = 100
      "memoryTotal"                    = 64000
      "scaleDownDelayAfterAdd"         = 15
      "scaleDownDelayAfterDelete"      = 15
      "scaleDownDelayAfterFailure"     = 3
      "scaleDownEnabled"               = true
      "scaleDownUnneededTime"          = 7
      "scaleDownUtilizationThreshold"  = "0.2"
      "scaleUpCpuUtilizationThreshold" = "0.60"
      "scaleUpMemUtilizationThreshold" = "0.75"
      "scaleUpUnscheduledPodEnabled"   = true
      "scaleUpUtilizationEnabled"      = true
      "unremovableNodeRecheckTimeout"  = 7
    }
  }
}

From a node-size perspective, I was using s3.large.4 for testing.

Here is an excerpt of the node pool config I used for testing (again, please make changes as per your testing region and setup):

resource "random_id" "id" {
  byte_length = 4
}

resource "opentelekomcloud_cce_node_pool_v3" "node_pool" {
  cluster_id         = var.cluster_id
  name               = var.name != "" ? var.name : "node-pool-${random_id.id.hex}"
  flavor             = var.node_flavor
  initial_node_count = 7
  availability_zone  = var.availability_zone
  key_pair           = var.keypair_name
  os                 = var.os

  scale_enable             = true
  min_node_count           = 3
  max_node_count           = 10
  scale_down_cooldown_time = 15
  priority                 = 1
  user_tags                = var.tags
  k8s_tags                 = var.k8s_tags

  docker_base_size = 20

  root_volume {
    size       = 50
    volumetype = "SSD"
  }

  data_volumes {
    size       = 50
    volumetype = "SSD"
  }


  lifecycle {
    ignore_changes = [
      initial_node_count,
    ]
    create_before_destroy = true
  }

  timeouts {
    create = "60m"
    update = "60m"
    delete = "60m"
  }

}

Now as stated, deploy a cluster with the node pool and the autoscaler, wait until the autoscaler removed some nodes, deploy some changes to the node pool, as a result, the node pool will scale back up to the initial_node_count.

So in our example above:

Provision the cluster with the autoscaler add-on and the node pool. Now there is a pool with 7 nodes.
Wait some time. The autoscaler will remove some nodes. Let's say there are now 5 nodes in the pool.
Change something with the node pool template, for example, add a new K8s Tag and run terraform apply.
The number of nodes will go back up to 7.

As I said, up is not terrible, as it will only cost money, but if there are currently more nodes than initial_node_count because the autoscaler did scale out, the apply will remove nodes, and that will lead to downtime in most cases.

anton-sidelnikov · 2023-09-06T08:46:37Z

@eduardOrthopy Hello, yes I found this issue, thanks for details, also it reproducible in UI, internal bug report created https://jira.tsi-dev.otc-service.com/browse/BM-2993

anton-sidelnikov added the research label Aug 4, 2023

artem-lifshits self-assigned this Aug 7, 2023

artem-lifshits assigned artem-lifshits and anton-sidelnikov and unassigned artem-lifshits Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Node Pool scaling on unrelated value change #2249

Unexpected Node Pool scaling on unrelated value change #2249

eduardOrthopy commented Aug 3, 2023

anton-sidelnikov commented Aug 29, 2023 •

edited

eduardOrthopy commented Sep 1, 2023

anton-sidelnikov commented Sep 1, 2023

eduardOrthopy commented Sep 6, 2023 •

edited

anton-sidelnikov commented Sep 6, 2023

Unexpected Node Pool scaling on unrelated value change #2249

Unexpected Node Pool scaling on unrelated value change #2249

Comments

eduardOrthopy commented Aug 3, 2023

Terraform provider version

Affected Resource(s)

Terraform Configuration Files

Debug Output/Panic Output

Steps to Reproduce

Expected Behavior

Actual Behavior

Important Factoids

References

Remarks

anton-sidelnikov commented Aug 29, 2023 • edited

eduardOrthopy commented Sep 1, 2023

anton-sidelnikov commented Sep 1, 2023

eduardOrthopy commented Sep 6, 2023 • edited

anton-sidelnikov commented Sep 6, 2023

anton-sidelnikov commented Aug 29, 2023 •

edited

eduardOrthopy commented Sep 6, 2023 •

edited