Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[馃悰 Bug]: Nodes couldn't active when enabling autoscaling and deployed on EKS #2232

Open
fazizsoltani opened this issue Apr 28, 2024 · 5 comments
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA

Comments

@fazizsoltani
Copy link

What happened?

When we enable autoscaling in helm chart, It doesn't work properly.
I'm using selenium grid helm chart on EKS. It works without autoscaling enabled But when I enable autoscaling, I couldn't see any active nodes in selenium.
image

Command used to start Selenium Grid with Docker (or Kubernetes)

value.yml for helm charts
    hub:
      serviceType: NodePort

    autoscaling:
      enabled: true

    ingress:
      enabled: true
      nginx: !
      annotations:
        "kubernetes.io/ingress.class": "alb"
        "alb.ingress.kubernetes.io/scheme": "internal"
        "alb.ingress.kubernetes.io/group.name": "alb-name"
        "alb.ingress.kubernetes.io/group.order": "300"
        "alb.ingress.kubernetes.io/listen-ports": "[{\"HTTPS\":443}, {\"HTTP\":80}]"
        "alb.ingress.kubernetes.io/ssl-redirect": "443"
        "alb.ingress.kubernetes.io/healthcheck-port": "8080"
        "alb.ingress.kubernetes.io/certificate-arn": "certificate-arn"

Relevant log output

kubectl logs keda-operator-bf9546dd-km68s
...
2024-04-28T18:48:30Z    ERROR   cert-rotation   Webhook not found. Unable to update certificate.        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"keda-admission\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
        /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
        /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-04-28T18:48:30Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-04-28T18:48:30Z    INFO    cert-rotation   no cert refresh needed
2024-04-28T18:48:30Z    ERROR   cert-rotation   Webhook not found. Unable to update certificate.        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"keda-admission\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
        /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
        /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-04-28T18:48:30Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-04-28T18:48:32Z    INFO    cert-rotation   CA certs are injected to webhooks
...
2024-04-28T18:48:42Z    ERROR   scaleexecutor   failed to patch Objects {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/status.TransformObject
        /workspace/pkg/status/status.go:195
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setCondition
        /workspace/pkg/scaling/executor/scale_executor.go:106
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setActiveCondition
        /workspace/pkg/scaling/executor/scale_executor.go:120
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:76
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-04-28T18:48:42Z    ERROR   scaleexecutor   Error setting active condition when triggers are not active     {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:77
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
...
2024-04-28T18:48:44Z    ERROR   scaleexecutor   failed to patch Objects {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/status.TransformObject
        /workspace/pkg/status/status.go:195
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setCondition
        /workspace/pkg/scaling/executor/scale_executor.go:106
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setActiveCondition
        /workspace/pkg/scaling/executor/scale_executor.go:120
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:76
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-04-28T18:48:44Z    ERROR   scaleexecutor   Error setting active condition when triggers are not active     {"scaledJob.Name": "selenium-edge-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:77
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
...
2024-04-28T18:48:45Z    ERROR   scaleexecutor   failed to patch Objects {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/status.TransformObject
        /workspace/pkg/status/status.go:195
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setCondition
        /workspace/pkg/scaling/executor/scale_executor.go:106
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).setActiveCondition
        /workspace/pkg/scaling/executor/scale_executor.go:120
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:76
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-04-28T18:48:45Z    ERROR   scaleexecutor   Error setting active condition when triggers are not active     {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium", "error": "client rate limiter Wait returned an error: context canceled"}
github.com/kedacore/keda/v2/pkg/scaling/executor.(*scaleExecutor).RequestJobScale
        /workspace/pkg/scaling/executor/scale_jobs.go:77
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:263
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

Operating System

Kubernetes, EKS

Docker Selenium version (image tag)

4.20.0-20240425

Selenium Grid chart version (chart version)

0.30.0

Copy link

@fazizsoltani, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

May I know if it could work before and it is broken after upgrading new chart version?

@fazizsoltani
Copy link
Author

No, I had problems with previous versions too.

@VietND96
Copy link
Member

VietND96 commented May 3, 2024

I saw something could relate to cert, SSL connection

2024-04-28T18:48:30Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-04-28T18:48:32Z    INFO    cert-rotation   CA certs are injected to webhooks

I also saw the config Hub using NodePort

    hub:
      serviceType: NodePort

Can you exec kubectl describe scaledJob to see details of a node scaledjob, I want to see section

triggers:
  - type: selenium-grid
    metadata:
...

@fazizsoltani
Copy link
Author

fazizsoltani commented May 3, 2024

kubectl describe scaledJob selenium-chrome-node

Namespace:    selenium
Labels:       app=selenium-chrome-node
              app.kubernetes.io/component=selenium-grid-4.20.0-20240425
              app.kubernetes.io/instance=selenium
              app.kubernetes.io/managed-by=helm
              app.kubernetes.io/name=selenium-chrome-node
              app.kubernetes.io/version=4.20.0-20240425
              component.autoscaling=true
              helm.sh/chart=selenium-grid-0.30.0
Annotations:  helm.sh/hook: post-install,post-upgrade,post-rollback,pre-delete
API Version:  keda.sh/v1alpha1
Kind:         ScaledJob
Metadata:
  Creation Timestamp:  2024-04-28T18:48:42Z
  Finalizers:
    finalizer.keda.sh
  Generation:  3
  Managed Fields:
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.keda.sh":
      f:spec:
        f:rollout:
    Manager:      keda
    Operation:    Update
    Time:         2024-04-28T18:48:42Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:helm.sh/hook:
        f:labels:
          .:
          f:app:
          f:app.kubernetes.io/component:
          f:app.kubernetes.io/instance:
          f:app.kubernetes.io/managed-by:
          f:app.kubernetes.io/name:
          f:app.kubernetes.io/version:
          f:component.autoscaling:
          f:helm.sh/chart:
      f:spec:
        .:
        f:failedJobsHistoryLimit:
        f:jobTargetRef:
          .:
          f:backoffLimit:
          f:completions:
          f:parallelism:
          f:template:
            .:
            f:metadata:
              .:
              f:annotations:
                .:
                f:checksum/event-bus-configmap:
                f:checksum/logging-configmap:
                f:checksum/node-configmap:
                f:checksum/server-configmap:
              f:labels:
                .:
                f:app:
                f:app.kubernetes.io/component:
                f:app.kubernetes.io/instance:
                f:app.kubernetes.io/managed-by:
                f:app.kubernetes.io/name:
                f:app.kubernetes.io/version:
                f:helm.sh/chart:
            f:spec:
              .:
              f:containers:
              f:restartPolicy:
              f:serviceAccount:
              f:serviceAccountName:
              f:terminationGracePeriodSeconds:
              f:volumes:
        f:maxReplicaCount:
        f:minReplicaCount:
        f:pollingInterval:
        f:scalingStrategy:
          .:
          f:strategy:
        f:successfulJobsHistoryLimit:
        f:triggers:
    Manager:      terraform-provider-helm_v2.11.0_x5
    Operation:    Update
    Time:         2024-04-28T18:48:42Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
    Manager:      keda
    Operation:    Update
    Subresource:  status
    Time:         2024-04-28T18:48:53Z
    API Version:  keda.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:kubectl.kubernetes.io/last-applied-configuration:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2024-04-28T18:48:53Z
  Resource Version:  108763431
  UID:               0b1ddbc0-0e66-4849-9e35-5de9dca3179a
Spec:
  Failed Jobs History Limit:  0
  Job Target Ref:
    Backoff Limit:  0
    Completions:    1
    Parallelism:    1
    Template:
      Metadata:
        Annotations:
          checksum/event-bus-configmap:  4e264bd45e78bf454c38
          checksum/logging-configmap:    c7f18f9e715bc62bca7234
          checksum/node-configmap:       bd257694e2cfebd395a9
          checksum/server-configmap:     4af2ca96bbaebd763d5
        Labels:
          App:                           selenium-chrome-node
          app.kubernetes.io/component:   selenium-grid-4.20.0-20240425
          app.kubernetes.io/instance:    selenium
          app.kubernetes.io/managed-by:  helm
          app.kubernetes.io/name:        selenium-chrome-node
          app.kubernetes.io/version:     4.20.0-20240425
          helm.sh/chart:                 selenium-grid-0.30.0
      Spec:
        Containers:
          Env:
            Name:   SE_OTEL_SERVICE_NAME
            Value:  selenium-chrome-node
            Name:   SE_NODE_PORT
            Value:  5555
            Name:   SE_NODE_REGISTER_PERIOD
            Value:  60
            Name:   SE_NODE_REGISTER_CYCLE
            Value:  5
          Env From:
            Config Map Ref:
              Name:  selenium-event-bus
            Config Map Ref:
              Name:  selenium-node-config
            Config Map Ref:
              Name:  selenium-logging-config
            Config Map Ref:
              Name:  selenium-server-config
            Secret Ref:
              Name:           selenium-secrets
          Image:              selenium/node-chrome:4.20.0-20240425
          Image Pull Policy:  IfNotPresent
          Lifecycle:
            Pre Stop:
              Exec:
                Command:
                  bash
                  -c
                  /opt/selenium/nodePreStop.sh
          Name:  selenium-chrome-node
          Ports:
            Container Port:  5555
            Protocol:        TCP
          Resources:
            Limits:
              Cpu:     1
              Memory:  1Gi
            Requests:
              Cpu:     1
              Memory:  1Gi
          Startup Probe:
            Exec:
              Command:
                bash
                -c
                /opt/selenium/nodeProbe.sh Startup
            Failure Threshold:  12
            Period Seconds:     5
            Success Threshold:  1
            Timeout Seconds:    60
          Volume Mounts:
            Mount Path:                    /dev/shm
            Name:                          dshm
            Mount Path:                    /opt/selenium/nodePreStop.sh
            Name:                          selenium-node-config
            Sub Path:                      nodePreStop.sh
            Mount Path:                    /opt/selenium/nodeProbe.sh
            Name:                          selenium-node-config
            Sub Path:                      nodeProbe.sh
        Restart Policy:                    Never
        Service Account:                   selenium-serviceaccount
        Service Account Name:              selenium-serviceaccount
        Termination Grace Period Seconds:  30
        Volumes:
          Config Map:
            Default Mode:  493
            Name:          selenium-node-config
          Name:            selenium-node-config
          Empty Dir:
            Medium:      Memory
            Size Limit:  1Gi
          Name:          dshm
  Max Replica Count:     8
  Min Replica Count:     0
  Polling Interval:      10
  Rollout:
  Scaling Strategy:
    Strategy:                     accurate
  Successful Jobs History Limit:  0
  Triggers:
    Metadata:
      Browser Name:          chrome
      Platform Name:         linux
      Session Browser Name:  chrome
      Trigger Index:         0
      Unsafe Ssl:            true
      URL:                   http://admin:admin@selenium-hub.selenium:4444/graphql
    Type:                    selenium-grid
Status:
  Conditions:
    Message:  ScaledJob is defined correctly and is ready to scaling
    Reason:   ScaledJobReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
Events:       <none>```

@VietND96 VietND96 changed the title [馃悰 Bug]: [馃悰 Bug]: Nodes couldn't active when enabling autoscaling and deployed on EKS May 6, 2024
@VietND96 VietND96 added I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA and removed needs-triaging labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-autoscaling-k8s Issue relates to autoscaling in Kubernetes, or the scaler in KEDA
Projects
None yet
Development

No branches or pull requests

2 participants