Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck in Terminating state despite StatefulSet replica adjustment to 0 in Kubernetes v1.26.2 cluster #5811

Open
orangeguo3 opened this issue Mar 17, 2024 · 8 comments
Labels
Waiting on feedback Issues that require feedback from User/Other community members

Comments

@orangeguo3
Copy link

orangeguo3 commented Mar 17, 2024

Describe the bug

I recently updated my Kubernetes version from v1.25.12 to v1.26.2. Previously, everything was running smoothly with Kubernetes version v1.25.12 and fabric8 k8s client 6.3.0.

However, after the update to v1.26.2, I encountered an issue. When using the k8s-client API, pods mounted with PVCs get stuck in the terminating state, whether directly deleting StatefulSets or scaling down replicas to 0. PVC is stuck in terminating as well. There is no event in pod.

Strangely, if I use kubectl directly in the terminal to delete the StatefulSet, both the StatefulSet and pods can be deleted successfully.

I attempted to resolve this by updating my fabric8 k8s client version to 6.10.0, but unfortunately, I still faced the same error.

The pods are created via StatefulSets, and finalizers are set in the PVCs. However, I expect not to manually modify or remove finalizers.

Fabric8 Kubernetes Client version

6.10.0

Steps to reproduce

final StatefulSet statefulSet = client.apps().statefulSets().load(getClass().getClassLoader().getResourceAsStream("Statefulset.yaml")).item();

client.apps().statefulSets().inNamespace(namespace).withName(statefulSet.getMetadata().getName()).delete();
NAME                  READY   STATUS        RESTARTS   AGE
opensearch-data-0     1/1     Terminating   0          99m
opensearch-master-0   1/1     Terminating   0          99m

$ k g pvc -n aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa0
NAME                                    STATUS        VOLUME                                                                   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
opensearch-data-opensearch-data-0       Terminating   data-aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa0-0     50Gi       RWO            oci-bv         59m
opensearch-master-opensearch-master-0   Terminating   master-aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa0-0   50Gi       RWO            oci-bv         59m

I also attach the description of terminating pod here:

Describe of pod - Click to expand
$ k d pod opensearch-master-0 -n aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa0
Name:                      opensearch-master-0
Namespace:                 aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa0
Priority:                  0
Service Account:           default
Node:                      10.0.14.221/10.0.14.221
Start Time:                Sun, 17 Mar 2024 04:25:58 -0700
Labels:                    app=opensearch-master
                           chart=opensearch
                           controller-revision-hash=opensearch-master-7f9456c94d
                           heritage=Helm
                           nodeNamespaceKey=m-4a3oi2bcgz25kuorvhhu4ozc54wa0
                           release=RELEASE-NAME
                           statefulset.kubernetes.io/pod-name=opensearch-master-0
Annotations:               <none>
Status:                    Terminating (lasts 40m)
Termination Grace Period:  120s
IP:                        172.17.6.195
IPs:
  IP:           172.17.6.195
Controlled By:  StatefulSet/opensearch-master
Init Containers:
  configure-sysctl:
    Container ID:  cri-o://f5ef4a758122c224ea3f01dd393131bbe4a10562b74ba21dddd4f59e9b2732cc
    Image:         iad.ocir.io/axoxdievda5j/oci-opensearch:2.3.0.26.16
    Image ID:      iad.ocir.io/axoxdievda5j/oci-opensearch@sha256:8b7e047d3b3a53c4a31cf9c99e462e921ec8f166e1a6f2cd8918d64e4e86e807
    Port:          <none>
    Host Port:     <none>
    Command:
      sysctl
      -w
      vm.max_map_count=262144
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 17 Mar 2024 04:26:34 -0700
      Finished:     Sun, 17 Mar 2024 04:26:34 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kbgqv (ro)
Containers:
  opensearch:
    Container ID:   cri-o://3932f1359acd7f6c13907886f655da3309dd20088c5d80bca3aeb59129e2cbd4
    Image:          iad.ocir.io/axoxdievda5j/oci-opensearch:2.3.0.26.16
    Image ID:       iad.ocir.io/axoxdievda5j/oci-opensearch@sha256:8b7e047d3b3a53c4a31cf9c99e462e921ec8f166e1a6f2cd8918d64e4e86e807
    Ports:          9200/TCP, 9300/TCP, 9200/TCP, 9300/TCP
    Host Ports:     0/TCP, 0/TCP, 9200/TCP, 9300/TCP
    State:          Running
      Started:      Sun, 17 Mar 2024 04:26:35 -0700
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      1
      memory:   2Gi
    Readiness:  exec [sh -c #!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=red&timeout=1s&local=true' )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file

http () {
    local path="${1}"
    curl -XGET -s -k --fail --insecure https://127.0.0.1:9200${path}
}

if [ -f "${START_FILE}" ]; then
    echo 'Cluster is already running, lets check the node is healthy and there are master nodes available'
    http "/_cluster/health?timeout=0s&local=true"
else
    echo 'Waiting for cluster to become ready (request params: "wait_for_status=red&timeout=1s&local=true" )'
    if http "/_cluster/health?wait_for_status=red&timeout=1s&local=true" ; then
        touch ${START_FILE}
        exit 0
    else
        echo 'Cluster is not yet ready (request params: "wait_for_status=red&timeout=1s&local=true" )'
        exit 1
    fi
fi
] delay=10s timeout=5s period=10s #success=3 #failure=3
    Environment:
      node.name:                                                opensearch-master-0 (v1:metadata.name)
    Mounts:
      /etc/oci-pki from etc-oci-pki (rw)
      /etc/pki from etc-pki (rw)
      /etc/rbcp_core_regions_artifacts from dynamic-regions-default (ro)
      /etc/region from etc-region (ro)
      /usr/share/opensearch/data from opensearch-master (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kbgqv (ro)
      /var/run/secrets/resource-principal from resource-principal (ro)
      /var/run/secrets/resource-principal-snapshots from resource-principal-snapshots (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  opensearch-master:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  opensearch-master-opensearch-master-0
    ReadOnly:   false
  etc-pki:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/pki
    HostPathType:  Directory
  etc-oci-pki:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/oci-pki
    HostPathType:  Directory
  etc-region:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/region
    HostPathType:  
  dynamic-regions-default:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/rbcp_core_regions_artifacts
    HostPathType:  Directory
  resource-principal:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  resource-principal-opensearchcluster-unstable-ocid1.opensearchcluster.region1.sea.aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa-0
    Optional:    false
  resource-principal-snapshots:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  resource-principal-opensearch-unstable-ocid1.opensearchcluster.region1.sea.aaaaaaaadg2cbif7hcycp7mocvwbluu24a3oi2bcgz25kuorvhhu4ozc54wa-0
    Optional:    false
  kube-api-access-kbgqv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              nodeNamespaceKey=m-4a3oi2bcgz25kuorvhhu4ozc54wa0
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Expected behavior

pod is deleted together with statefulset. No need to manually remove the finalizer from PVC

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

next (development version)

Environment

OCI cloud

Fabric8 Kubernetes Client Logs

No response

Additional context

No response

@shawkins
Copy link
Contributor

Presumably this is due to https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolume-deletion-protection-finalizer - can you confirm what finalizer is in play here?

I don't see why the statefulset or pods would stick around unless they also had finalizers, or you are using foreground deletion.

@orangeguo3
Copy link
Author

orangeguo3 commented Mar 18, 2024

Hi @shawkins Yes we we have this pvc-protection finalizers in our pvc

  finalizers:
  - kubernetes.io/pvc-protection

We also have finalizes in PV

  finalizers:
    - kubernetes.io/pv-protection
    - external-attacher/blockvolume-csi-oraclecloud-com

But we don't have finalizers in pod.

  1. My statefulset.yaml file:
    StatefulSet.txt
  2. pod -o yaml file:
    pod_yaml.txt
  3. pvc -o yaml file
    pvc_yaml.txt

It seems there might be an issue with unmounting. I have another Pod that doesn't have any PV mounts, it's deployed using Deployment, and I can successfully delete it with api. (Although I believe the issue isn't related to Deployment or StatefulSet).

If I directly delete the StatefulSet using kubectl, the Pod disappears without getting stuck in the "Terminating" state. The PVC remains, but it's not bound to any Pod. Then, I can delete the PVC using the Kubernetes client API.

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.3", GitCommit:"434bfd82814af038ad94d62ebe59b133fcb50506", GitTreeState:"clean", BuildDate:"2022-10-12T10:57:26Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"5eaeff77954588568039a1d73ed9ae0ee7c9ba71", GitTreeState:"clean", BuildDate:"2023-03-20T19:34:00Z", GoVersion:"go1.19.6 A1727 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

@shawkins
Copy link
Contributor

shawkins commented Mar 18, 2024

There are two things going on here. The first is that statefulset pods have blockOwnerDeletion set to true, which is effectively forcing the foreground deletion behavior. The other is the pv-protection finalizer, which is keeping the pv around until the pvc is deleted.

The fix for the statefulset deletion may be to first scale the statefulset to 0. I recall needed to do that in several operators. If the kubectl client is doing that automatically, then there's a case to be made for an enhancement to fabric8 to do the same. EDIT: I should clarify that this doesn't appear to be strictly required - trying with simple statefulset based upon the examples in the Kubernetes docs, they will delete just fine with mulitple replicas. Can you try scaling to 0 and seeing if you still get pods stuck in the terminating state? If so this will clarify that there is a general issue terminating that is not related to the statefulset deletion, and if they do terminate sucessfully it should be a viable workaround to scale the statefulset to 0 before deletion.

Having a pod get stuck in the terminating state is not something that is clear from what you have above. From your first comment it looks like you may have first attempted to delete the pv without deleting the pvc, but locally that didn't cause the termination issue for me. Based upon the grace period, it will take up to 2 minutes for it to actually be terminated - at which point the pod and statefulset will go away.

As you reference above the expected behavior with the delete reclaim policy is that the pv and pvc will remain after the statefulset and pod are deleted.

@orangeguo3
Copy link
Author

orangeguo3 commented Mar 21, 2024

Hi @shawkins I tried the scaling down but no luck.

                            client.apps()
                                    .statefulSets()
                                    .inNamespace(namespace)
                                    .withName(statefulSet.getMetadata().getName())
                                    .scale(0, true);

But this time I am able to see an Error Event in terminating pod. And this time PVC is still Bound (PVC is Terminating if I call delete statefulset directly)

  Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               15m                default-scheduler        Successfully assigned aaaaaaaavsghlprzxy5jmqpuukn2tq53art2ht2mtlhtxqyntvqxb35bnaoa0/opensearch-master-0 to 10.0.13.199
  Normal   SuccessfulAttachVolume  15m                attachdetach-controller  AttachVolume.Attach succeeded for volume "master-aaaaaaaavsghlprzxy5jmqpuukn2tq53art2ht2mtlhtxqyntvqxb35bnaoa0-0"
  Normal   Pulled                  15m                kubelet                  Container image "iad.ocir.io/axoxdievda5j/oci-opensearch:2.3.0.26.16" already present on machine
  Normal   Created                 15m                kubelet                  Created container configure-sysctl
  Normal   Started                 15m                kubelet                  Started container configure-sysctl
  Normal   Pulled                  15m                kubelet                  Container image "iad.ocir.io/axoxdievda5j/oci-opensearch:2.3.0.26.16" already present on machine
  Normal   Created                 15m                kubelet                  Created container opensearch
  Normal   Started                 15m                kubelet                  Started container opensearch
  Warning  Unhealthy               14m (x4 over 14m)  kubelet                  Readiness probe failed: Waiting for cluster to become ready (request params: "wait_for_status=red&timeout=1s&local=true" )
Cluster is not yet ready (request params: "wait_for_status=red&timeout=1s&local=true" )
  Warning  NodeNotReady  10m  node-controller  Node is not ready

We have this command running in the pod. Is it possible this is the reason? But it is running well with K8S 1.25.12 and fabric k9s-client 6.3.0

readinessProbe:
        exec:
          command:
            - sh
            - '-c'
            - >
              #!/usr/bin/env bash -e

              # If the node is starting up wait for the cluster to be ready
              (request params: 'wait_for_status=red&timeout=1s&local=true' )

              # Once it has started only check that the node itself is
              responding

              START_FILE=/tmp/.es_start_file


              http () {
                  local path="${1}"
                  curl -XGET -s -k --fail --insecure https://127.0.0.1:9200${path}
              }


              if [ -f "${START_FILE}" ]; then
                  echo 'Cluster is already running, lets check the node is healthy and there are master nodes available'
                  http "/_cluster/health?timeout=0s&local=true"
              else
                  echo 'Waiting for cluster to become ready (request params: "wait_for_status=red&timeout=1s&local=true" )'
                  if http "/_cluster/health?wait_for_status=red&timeout=1s&local=true" ; then
                      touch ${START_FILE}
                      exit 0
                  else
                      echo 'Cluster is not yet ready (request params: "wait_for_status=red&timeout=1s&local=true" )'
                      exit 1
                  fi
              fi
        initialDelaySeconds: 10
        timeoutSeconds: 5
        periodSeconds: 10
        successThreshold: 3
        failureThreshold: 3

@orangeguo3
Copy link
Author

orangeguo3 commented Mar 21, 2024

May I also ask why you think "you may have first attempted to delete the pv without deleting the pvc"? Because in our code we delete namespace first(which cleans all the resources including pvc), then we delete pv

Our PV's persistentVolumeReclaimPolicy is Retain

PV yaml apiVersion: v1 kind: PersistentVolume metadata: name: data-aaaaaaaavsghlprzxy5jmqpuukn2tq53art2ht2mtlhtxqyntvqxb35bnaoa0-0 uid: d18f781b-1a16-402e-a316-48b318e6a9df resourceVersion: '3659822' creationTimestamp: '2024-03-21T23:18:53Z' finalizers: - kubernetes.io/pv-protection - external-attacher/blockvolume-csi-oraclecloud-com managedFields: - manager: fabric8-kubernetes-client operation: Update apiVersion: v1 time: '2024-03-21T23:18:53Z' fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"external-attacher/blockvolume-csi-oraclecloud-com": {} v:"kubernetes.io/pv-protection": {} f:spec: f:accessModes: {} f:capacity: .: {} f:storage: {} f:claimRef: .: {} f:kind: {} f:name: {} f:namespace: {} f:uid: {} f:csi: .: {} f:driver: {} f:fsType: {} f:volumeAttributes: .: {} f:vpusPerGB: {} f:volumeHandle: {} f:nodeAffinity: .: {} f:required: {} f:persistentVolumeReclaimPolicy: {} f:storageClassName: {} f:volumeMode: {} - manager: kube-controller-manager operation: Update apiVersion: v1 time: '2024-03-21T23:18:53Z' fieldsType: FieldsV1 fieldsV1: f:status: f:phase: {} subresource: status selfLink: >- /api/v1/persistentvolumes/data-aaaaaaaavsghlprzxy5jmqpuukn2tq53art2ht2mtlhtxqyntvqxb35bnaoa0-0 status: phase: Bound spec: capacity: storage: 50Gi csi: driver: blockvolume.csi.oraclecloud.com volumeHandle: >- ocid1.volume.oc1.iad.abuwcljtsak3qtpgf7rqy2ualvjbrnymtvmz27kbm75xgaotatbenlk3oqoa fsType: ext4 volumeAttributes: vpusPerGB: '10' accessModes: - ReadWriteOnce claimRef: kind: PersistentVolumeClaim namespace: aaaaaaaavsghlprzxy5jmqpuukn2tq53art2ht2mtlhtxqyntvqxb35bnaoa0 name: opensearch-data-opensearch-data-0 uid: c86d5a33-79c1-4332-a457-371bb53a759f persistentVolumeReclaimPolicy: Retain storageClassName: oci-bv volumeMode: Filesystem nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - US-ASHBURN-AD-3

@shawkins
Copy link
Contributor

shawkins commented Mar 22, 2024

But this time I am able to see an Error Event in terminating pod.

I think that just confirms it's taking a long time for the pod to terminate and it's failing the readiness probe while doing so. The best guess is that this is a symptom that your pod is not responding to termination signals properly and so it's taking until the end of the termination grace period to fully go away.

We have this command running in the pod. Is it possible this is the reason? But it is running well with K8S 1.25.12 and fabric k9s-client 6.3.0

I don't think this behavior has anything to do with the kubernetes client. The scaling operation is simply an adjustment to the StatefulSet, then the client optionally waiting to observe that the operation completed.

If the behavior in kubectl is different, then I would guess there is some legacy default there that is forcing the deletion of pods rather than waiting for the natural termination.

May I also ask why you think "you may have first attempted to delete the pv without deleting the pvc"?

Because the pv's were stuck in termination waiting for the deletion of the pvcs.

@rohanKanojia rohanKanojia added the Waiting on feedback Issues that require feedback from User/Other community members label Apr 2, 2024
@orangeguo3
Copy link
Author

orangeguo3 commented Apr 2, 2024

Hi @shawkins I added the force delete command for pods, and now I can successfully delete pods.
But when I use kubectl to delete, even if I use a very large grace period, the pod will be deleted immediately instead of getting stuck in terminating.

When posting this issue, I see the latest version is v1.25.3. Do you still consider v1.26 as development version?

@shawkins
Copy link
Contributor

shawkins commented Apr 3, 2024

But when I use kubectl to delete, even if I use a very large grace period, the pod will be deleted immediately instead of getting stuck in terminating.

You mean delete the statefulset correct? Can you double check the kubectl source and see if it's defaulting to a forced deletion of the pods? If so we could either do the same in the kubernetes client.

When posting this issue, I see the latest version is v1.25.3. Do you still consider v1.26 as development version?

Generally the client is forwards compatible with later kubernetes versions. There is nothing that the client would be doing differently here based upon the kuberentes version.

Also know that when the client is updated such that its built-in model classes are updated for later kubernetes versions that is generally non-breaking, but in circumstances where you may rely upon deprecated functionality on an older kubernetes release, then a newer client may not provide that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting on feedback Issues that require feedback from User/Other community members
Projects
None yet
Development

No branches or pull requests

3 participants