Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get a working build for the etcd operator #1

Draft
wants to merge 42 commits into
base: master
Choose a base branch
from
Draft

Conversation

neoaggelos
Copy link
Owner

Summary

Update dependencies and upgrade to Go 1.17

@xrl
Copy link

xrl commented Mar 18, 2022

I have code for an "out of cluster" config -- it lets me run the operator from my computer as a regular process, respecting my ~/.kube/config. It has made development very straight-forward. Should I open a PR against this branch?

@neoaggelos
Copy link
Owner Author

I have code for an "out of cluster" config -- it lets me run the operator from my computer as a regular process, respecting my ~/.kube/config. It has made development very straight-forward. Should I open a PR against this branch?

@xrl Feel free, thanks!

@xrl
Copy link

xrl commented Mar 23, 2022

Turns out the architecture is not compatible with running in local mode, the operator wants to use an etcd client against the replicas. I can't do that easily from an out of cluster process. So I'm just going to forget about it.

@xrl
Copy link

xrl commented Mar 28, 2022

The helm chart isn't accessible the way it has been configured, github's raw HTTP server won't follow the symlink on latest:

works: https://raw.githubusercontent.com/neoaggelos/etcd-operator/revive/chart/v0.1.0/etcd-operator-0.1.0.tgz
does not work (404): https://raw.githubusercontent.com/neoaggelos/etcd-operator/revive/chart/v0.1.0/etcd-operator-0.1.0.tgz

another issue, the index.yaml lists the 0.1.0 version twice: https://github.com/neoaggelos/etcd-operator/blob/revive/chart/index.yaml#L23 and https://github.com/neoaggelos/etcd-operator/blob/revive/chart/index.yaml#L13 . I bet removing the latest flavor would work. As it is, the helm repo isn't usable.

My hack for testing:

helm repo add etcd-operator https://raw.githubusercontent.com/neoaggelos/etcd-operator/revive/chart/
helm repo update
helm upgrade --install internal etcd-operator/etcd-operator -f values/internal.yaml --kube-context=ents-us-dev --namespace=kv --version=0.1.0 --debug
history.go:56: [debug] getting history for release internal
Error: failed to fetch https://raw.githubusercontent.com/neoaggelos/etcd-operator/revive/chart/latest/etcd-operator-0.1.0.tgz : 404 Not Found
helm.go:84: [debug] failed to fetch https://raw.githubusercontent.com/neoaggelos/etcd-operator/revive/chart/latest/etcd-operator-0.1.0.tgz : 404 Not Found
helm.sh/helm/v3/pkg/getter.(*HTTPGetter).get
[[[ SNIP ]]]

@xrl
Copy link

xrl commented Mar 28, 2022

I am running this in a rancher cluster with some security constraints in place, I had to add this to the deployment.yaml:

      securityContext:
        runAsUser: 1000
        runAsGroup: 1000

might be good to make this a mountable object in the helm config, something like:

      securityContext:
        {{- toYaml .Values.securityContext | nindent 12 }}

what do you think?

@neoaggelos
Copy link
Owner Author

@xrl Thank you for the catch.

Fixed the latest issue, the latest symlink is not even required it seems.

Also added the securityContext option under operator.securityContext

I have been keeping this PR open for a while in case issues crop up, but I am considering merging it soon

@xrl
Copy link

xrl commented Apr 11, 2022

@neoaggelos
Copy link
Owner Author

Thanks for the heads up

@xrl
Copy link

xrl commented Apr 18, 2022

I have hit a problem with the TLS example, I ran the example gen-cert.sh and then uploaded the certs/keys as kube secrets, the example tls cluster was marked as failed because:

time="2022-04-18T15:46:59Z" level=error msg="cluster failed to setup: stat /tmp: no such file or directory" cluster-name=example cluster-namespace=kv pkg=cluster
time="2022-04-18T15:46:59Z" level=warning msg="fail to handle event: ignore failed cluster (example). Please delete its CR" pkg=controller

I'll work to sort it out now. I wonder why I don't have a tmp folder.

@xrl
Copy link

xrl commented Apr 21, 2022

I recommend changing pkg/apis/etcd/v1beta3/cluster.go to set DefaultEtcdVersion = "3.2.13" to DefaultEtcdVersion = "3.5.3". 3.5.3 is out now, by the way.

@neoaggelos
Copy link
Owner Author

Awesome, thanks for the heads up!

Have you had any luck/progress with the TLS issues?

@neoaggelos
Copy link
Owner Author

I think I will wait for etcd-io/etcd#13948 to be resolved, and skip to 3.5.4 directly.

@xrl
Copy link

xrl commented Apr 23, 2022

I did get TLS working! I ended up switching the operator image to ubuntu so it has a /tmp folder. My colleague was saying there's a trick to creating a /tmp with permissions inside of a scratch image but I couldn't get it working. Do you think it's important to stay with the scratch image of the operator?

Now the next problem is how to automate the management the CA and generation the peer/server/client certs, but that's out of scope for this repo. Here's the script for creating an appropriate set of secrets for the example/tls/example-tls-cluster.yaml:

kubectl create secret generic etcd-peer-tls --from-file=peer.crt --from-file=peer.key --from-file=peer-ca.crt
kubectl create secret generic etcd-client-tls --from-file=etcd-client.crt --from-file=etcd-client.key --from-file=etcd-client-ca.crt
kubectl create secret generic etcd-server-tls --from-file=server.crt --from-file=server.key --from-file=server-ca.crt

I would submit PRs to you but I have an internal branch which includes some drone-ci tooling, hopefully these breadcrumbs aren't too tiresome.

Also, waiting for 3.5.4 makes sense. A colleague mentioned there was a SRV DNS issue and I didn't know what they were talking about, thanks for the link.

@neoaggelos
Copy link
Owner Author

The fact that /tmp is implicitly required is intriguing to say the least. I wonder if one can get away with just making /tmp an emptyDir: {} volume on the operator deployment.

@neoaggelos
Copy link
Owner Author

I went ahead and tried it, seems to be working OK. Would you mind trying it out?

@xrl
Copy link

xrl commented Apr 24, 2022

Yeah, that does work but I'm not sure it's the deployment's responsibility to fix a flaw in the base image. I think the Dockerfile could be modified to include a tmp file but it would require ADD tmp.tar /tmp or similar with a magic tmp folder, we have to rely on the untar routine preserving permissions from the archive.

@xrl
Copy link

xrl commented Apr 24, 2022

But I agree, it is curious the code was error about a tmp folder missing. Reading over the code it looks like the TLS config object is built from the secret and kept entirely in memory. I should break my cluster and get that exact error, sorry for the wild goose chase!

@neoaggelos
Copy link
Owner Author

Yes, indeed. Perhaps an alpine image image could be a good middle ground. This would also give the operator image a basic sh and some busybox utils for simple debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants