Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nri-bundle] Error: couldn't find key cluster-id in Secret newrelic/pl-cluster-secrets #661

Open
luisdavim opened this issue Jan 19, 2022 · 12 comments

Comments

@luisdavim
Copy link

luisdavim commented Jan 19, 2022

Bug description

I'm trying to install the nri-bundle-3.3.0 chart using terraform and sometimes, not always, the installation fails because one of the pods fails to start within the wait time set for the helm release.
I'm setting a helm timeout of 900 seconds, and still, sometimes that's not enough...

When I inspect the Pod that is failing to start, I see the following error in its events:

Error: couldn't find key cluster-id in Secret newrelic/pl-cluster-secrets

If I wait for long enough, it eventually works, a way to speed it up is to delete the failed Pod until it succeeds, but I don't think this is viable, we're usgin Terraform to provision our clusters, and we end up wasting time because of this when it sould be able to run unattended.

Version of Helm and Kubernetes

helm version
version.BuildInfo{Version:"v3.7.2", GitCommit:"663a896f4a815053445eec4153677ddc24a0a361", GitTreeState:"clean", GoVersion:"go1.16.10"}
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

The chart is nri-bundle-3.3.0

What happened?

The helm release fails, waiting for all the resources to become ready within 900 seconds.

What you expected to happen?

I would expect the deployment to succeed, this seems to be some sort of race condition where a secret (pl-cluster-secrets) is being created/updated after the pod that needs it to start, so I'd expect that secret to be ready before the deployment is created.
I would also expect 900 seconds to be enough time for any helm release.

How to reproduce it?

Just a normal helm install as mentioned in the readme, these are the values I'm using:

global:
  cluster: ${clusterName}
  licenseKey: ${newRelicLicenseKey}
  lowDataMode: true
kubeEvents:
  enabled: true
webhook:
  enabled: true
prometheus:
  enabled: true
logging:
  enabled: true
ksm:
  enabled: false
newrelic-infrastructure:
  privileged: true
newrelic-pixie:
  apiKey: ${pixieApiKey}
  enabled: true
pixie-chart:
  clusterName: ${clusterName}
  deployKey: ${pixieChartKey}
  enabled: true

This seems to be similar to #539

@davidgit davidgit added the triage/pending Issue or PR is pending for triage and prioritization. label Jan 20, 2022
@luisdavim
Copy link
Author

In case this helps, this is the Terraform code we're using to deploy the chart:

resource "helm_release" "newrelic" {
  count            = var.enable_newrelic ? 1 : 0
  chart            = "nri-bundle"
  name             = "newrelic-bundle"
  repository       = "https://helm-charts.newrelic.com"
  version          = "3.3.0"
  create_namespace = true
  namespace        = "newrelic"
  timeout          = 900
  max_history      = 10

  values = [
    templatefile("${path.module}/templates/values-newrelic.yaml",
      {
        clusterName        = var.cluster_name
        newRelicLicenseKey = local.shared-licences["newrelicLicenseKey"]
        pixieApiKey        = local.shared-licences["newrelicPixieApiKey"]
        pixieChartKey      = local.shared-licences["newrelicPixieChartKey"]
    }),
  ]

  depends_on = [
    kubectl_manifest.pixie-viziers,
    kubectl_manifest.pixie-crd,
  ]
}

resource "kubectl_manifest" "pixie-viziers" {
  count     = var.enable_newrelic ? length(data.kubectl_file_documents.pixie-viziers[0].documents) : 0
  yaml_body = element(data.kubectl_file_documents.pixie-viziers[0].documents, count.index)

  depends_on = [
    null_resource.cluster_up
  ]
}

resource "kubectl_manifest" "pixie-crd" {
  count     = var.enable_newrelic ? length(data.kubectl_file_documents.pixie-crd[0].documents) : 0
  yaml_body = element(data.kubectl_file_documents.pixie-crd[0].documents, count.index)

  depends_on = [
    null_resource.cluster_up
  ]
}

data "kubectl_file_documents" "pixie-viziers" {
  count   = var.enable_newrelic ? 1 : 0
  content = data.http.pixie-viziers[0].body
}

data "kubectl_file_documents" "pixie-crd" {
  count   = var.enable_newrelic ? 1 : 0
  content = data.http.pixie-crd[0].body
}

data "http" "pixie-viziers" {
  count = var.enable_newrelic ? 1 : 0
  url   = "https://raw.githubusercontent.com/pixie-labs/pixie/release/cloud/prod/1642205277/k8s/operator/crd/base/px.dev_viziers.yaml"
}

data "http" "pixie-crd" {
  count = var.enable_newrelic ? 1 : 0
  url   = "https://raw.githubusercontent.com/pixie-labs/pixie/release/cloud/prod/1642205277/k8s/operator/helm/crds/olm_crd.yaml"
}

data "aws_ssm_parameter" "licences" {
  count = var.enable_newrelic ? 1 : 0
  name  = "/shared/licences"
}

Where the values file template is:

global:
  cluster: ${clusterName}
  licenseKey: ${newRelicLicenseKey}
  lowDataMode: true
kubeEvents:
  enabled: true
webhook:
  enabled: true
prometheus:
  enabled: true
logging:
  enabled: true
ksm:
  enabled: false
newrelic-infrastructure:
  privileged: true
newrelic-pixie:
  apiKey: ${pixieApiKey}
  enabled: true
pixie-chart:
  clusterName: ${clusterName}
  deployKey: ${pixieChartKey}
  enabled: true

We're deploying to an EKS cluster and from what I've observed, the pl-cluster-secrets is created without the cluster-id key and gets updated latter with it, it takes about 3 to 5 minutes for the secret to get updated with the missing key, if after that we manually delete the failed Pod, the helm release resumes and finishes successfully but most of the time, the release fails if there's no manual intervention.

@luisdavim
Copy link
Author

Our workaround for now is to set wait = false on the helm release and once everything is provisioned, come back and manually delete the pod in the CreateContainerConfigError state, but it would be great if we didn't need this manual step to get it working...

@mehmetyazicioglu
Copy link

mehmetyazicioglu commented Mar 11, 2022

i am having the same issue which is "[“pl-cluster-secrets can not find” when i try to install the helm chart.

`i have my secrets
newrelic-bundle-nri-metadata-injection-admission Opaque 3 6h35m 
newrelic-kube-events-token-56mf9
 kubernetes.io/service-account-token 3 3h20m 
 newrelic-pixie-1646847142-newrelic-pixie-secrets Opaque 2 3s 
 newrelic-token-hxv4j kubernetes.io/service-account-token 3 3h20m sh.helm.release.v1.newrelic-pixie-1646847142.v1 helm.sh/release.v1 1 3s`

but still not working, i tried to delete the pods, but it did not work.

@luisdavim
Copy link
Author

luisdavim commented Mar 23, 2022

After updating to the latest chart version (3.4.0), the workaround doesn't seem to work anymore, deleting the failed pod no longer solves the problem.

@jz-wilson
Copy link

I think I found the problem, I am having the same issue with one of the vizer-metadata pods:

E0427 15:29:49.227982       1 reflector.go:138] external/io_k8s_client_go/tools/cache/reflector.go:167: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:monitoring:metadata-service-account" cannot list resource "endpoints" in API group "" at the cluster scope

I think this is preventing vizer from updating that secret.

@kang-makes
Copy link
Contributor

kang-makes commented May 10, 2022

The thing here is that is not on newrelic-pixie but in the pixie operator to create that secret. So newrelic-pixie is going to stay in CreateContainerError or CreateContainerConfigError until the operator creates that secret.

It is a circular dependency that, for now, we cannot solve.

Raise the amount of --wait time or get rid of it.

@kang-makes kang-makes added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 10, 2022
@luisdavim
Copy link
Author

Isn't pixie part of newrelic? Where can we raise an issue for this to get solved in the pixie operator?

@kang-makes
Copy link
Contributor

We are all from New Relic but in different teams in the same way that nri-bundle installs things from infrastructure, logging, and synthetics.

I have asked in the internal Slack for somebody to take a look. They should prioritize issues on their boards so they can take some time to answer.

@kang-makes kang-makes removed the triage/needs-information Indicates an issue needs more information in order to work on it. label May 11, 2022
@aimichelle
Copy link
Contributor

Hello! I am from the Pixie team at New Relic. This isn't an issue with the Pixie operator, but how the current NR/Pixie integration works. This issue should be fixed by an update to how the whole NR/Pixie integration mechanism will work. This should be out by the end of the month, and we will update here when that is ready.

@davidgit davidgit added team/pixie and removed triage/pending Issue or PR is pending for triage and prioritization. labels Jun 17, 2022
@IliaGe
Copy link

IliaGe commented Nov 23, 2022

Hey,
Facing the same issue - "Error: secret "pl-cluster-secrets" not found"
Deleting the POD doesn't help :(
Is there a workaround I can do to trigger the secret creation ?

@vuqtran88
Copy link

vuqtran88 commented Dec 20, 2022

Yea, I also got the error Error: secret "pl-cluster-secrets" not found when installing pixie via the guided installer. While searching for the cause, I landed on this page and it looks like a known issue. However, looks like the issue is intermittent. It worked for me after a few times of reinstall.

@thakerb
Copy link

thakerb commented Dec 21, 2022

@aimichelle : can you please take a look and suggest the next steps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants