Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres fails to start when Kubernetes doesn't chown volume upon mount #1690

Open
gabegorelick opened this issue Mar 30, 2021 · 3 comments
Open

Comments

@gabegorelick
Copy link

If the Postgres data directory is not owned by the same user ID as the postgres process (currently 999 [1]), kotsadm-postgres-0 will crash with the following error:

FATAL: data directory "/var/lib/postgresql/data/pgdata" has wrong ownership
HINT: The server must be started by the user that owns the data directory.

With persistent volumes backed my most block storage devices, Kubernetes will recursively call chown and chmod on the mounted files and directories inside the volume, so the Postgres data directory will have the correct owner. But Kubernetes will not call chown/chmod for all volume types. The details are best summarized in this blog post:

Traditionally if your pod is running as a non-root user (which you should), you must specify a fsGroup inside the pod’s security context so that the volume can be readable and writable by the Pod.
...
But one side-effect of setting fsGroup is that, each time a volume is mounted, Kubernetes must recursively chown() and chmod() all the files and directories inside the volume - with a few exceptions noted below.
...
For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.

All this means that if you use something like NFS for your persistent volumes, Kubernetes won't [2] call chown when the volume is mounted. Thus, Postgres will fail to start.

The traditional solution to this is to run chown on the relevant directories before Postgres starts, e.g. via an entrypoint or some other process pre-mounting the volume and running chown. But to do this, you'd have to be able to edit kotsadm's Postgres deployment, and kots doesn't really make it possible to edit its resources before installing. Instead, all you can do is patch the postgres deployment manually after kotsadm is installed.

A better solution may be for kots to chown the Postgres data directory itself on startup.

[1]

RunAsUser: util.IntPointer(999),
FSGroup: util.IntPointer(999),

[2] At some point, Kubernetes will stabilize the fsGroupPolicy interface and CSI drivers can opt in to this behavior explicitly. But even then, enabling that on the default storage class just so kotsadm's Postgres works probably isn't the best solution.

@gabegorelick
Copy link
Author

gabegorelick commented Mar 30, 2021

Here's what seems to be working for me. After installing kotsadm, wait for kotsadm-postgres to fail. Then, patch it (kubectl patch statefulset kotsadm-postgres) with the following patch. EDIT: updated to work across pod restarts.

spec:
  template:
    spec:
      securityContext:
        # Run as root so we can create user to match NFS UID. We will drop permissions later on.
        fsGroup: 0
        runAsUser: 0
      containers:
        - name: kotsadm-postgres
          command: ['/bin/bash']
          args:
            - '-x'
            - '-c'
            - |
              # This should match mountPath of the container
              mountpath=/var/lib/postgresql/data

              # Path to file that tells us whether DB has already been initialized
              kotsadm_initialized="$mountpath/kotsadm_initialized"

              # Grab group ID and user ID of the mounted volume.
              # Many storage implementations will generate these every time the volume is mounted.
              gid="$(stat -c '%g' "$mountpath")"
              uid="$(stat -c '%u' "$mountpath")"

              # User to run postgres as. When restarting this pod, a pguser account may already exist.
              pguser="pgdataowner$uid"

              if ! id "$pguser" &> /dev/null; then
                echo "Adding user $pguser as $uid:$gid"
                groupadd --system --gid "$gid" "$pguser"
                useradd --system --uid "$uid" -g "$pguser" --shell /usr/sbin/nologin "$pguser"
              fi

              if [ ! -e "$kotsadm_initialized" ]; then
                # Delete half-initialized data from last time this container ran initdb.
                # We want docker-entry-point.sh to rerun initdb with the correct owner.
                rm -rf "$mountpath"/*

                # Don't delete data next time this pod restarts.
                touch "$kotsadm_initialized"
              fi

              # Run regular entrypoint as our custom user,
              # with extra logging so that we can confirm it's initialized everything correctly
              gosu "$pguser:$pguser" bash -x docker-entrypoint.sh postgres

This effectively forces postgres to run as the UID that owns the mount directory. While chowning the mount directory to the postgres UID (999) would be simpler, that doesn't work on NFS shares that have root squashing enabled.

Once the StatefulSet is patched, you'll then have to delete the existing kotsadm-postgres-0 pod due to how forced rollback works.

After kotsadm-postgres is online, the kotsadm-migrations pod that was failing during all of this should finally finish successfully, although I'm not sure if there are timeouts that would prevent that.

One last note, it looks like there may be plans to change the postgres container to alpine. That would probably require tweaking this patch. Although since we're already patching the StatefulSet, we can always substitute in whatever Docker image we want.

@gabegorelick
Copy link
Author

Here's the relevant PG code that enforces the ownership check. It seems like there's no way to relax this in PG.
https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/utils/init/miscinit.c#L331-L347

@gabegorelick
Copy link
Author

gabegorelick commented Apr 23, 2021

FYI, #1695, which is included in v1.37.0+, breaks the above workaround since the alpine container doesn't include things like useradd.

EDIT: more importantly, it's now using a read-only volume for /etc/passwd, so you can't add users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant