A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

wekb · 2019-04-17T02:06:55Z

NOTE: I have a change I can submit for review, given branch access.

TL;DR: This PR proposes a second GRPC health endpoint for use with GRPC transcoders running within Kubernetes. Kubernetes supports both liveness and readiness endpoints, but for HTTP->GRPC transcoding use cases, two separate GRPCs need to be registered to differentiate their functionality. (Kubernetes probe definitions)

For grpc-gateway and Envoy for example, the transcoding is 1:1 from HTTP endpoint to GRPC, and transcoding can't multiplex two HTTP endpoints to a single GRPC endpoint.

One alternative is to create a stub HTTP service within an otherwise pure GRPC service, however the business logic of liveness vs readiness would have to be duplicated in the HTTP logic layer, rather than within the parallel GRPC layer.

The proposed change essentially duplicates the Health GRPC and wraps it around the new "Readiness" name. This was chosen because of the simplicity and small change footprint.

An alternative approach could be to add two additional endpoints, one for Liveness and one for Readiness, each perhaps with their own (duplicate) Request and Response structures. There's a certain naming cleanliness here, but it seems overly complex.

Feedback welcome!

Quick summary of liveness vs readiness use cases:

A liveness check should be used to verify the application/service itself
is up and serving GRPC.
IMPORTANT: Kubernetes will restart containers that fail a liveness check.
A readiness check should be used to verify that the application/service
has completed loading any prerequisite data, and that the service's
upstream dependencies (e.g., MySQL, Kafka) are available and working.
IMPORTANT: Kubernetes will remove the container from the service load
balancers, but will not restart the container.

In other words, liveness represents permanent failure, while readiness
represents a transient failure, and the application-side implementation of
these checks should differ accordingly.

ejona86 · 2019-04-17T03:54:55Z

NOTE: I have a change I can submit for review, given branch access.

You don't need branch access to create PRs. Create a GitHub clone, put your branch there, and then make the PR.

(Kubernetes probe definitions)

Okay, just by reading the definitions: gRPC health is k8s readiness. gRPC doesn't have a liveness service, although honestly I tend to expect those to be call outs from the service to a watchdog. But I can still understand the calling being inverted.

The proposed change essentially duplicates the Health GRPC and wraps it around the new "Readiness" name.

Liveness needs to check that the service is actually functional. How is it expected to work? Each service in a server registers a callback to the readiness service under a particular name. When the readiness service is called, the service calls the callback for the name requested. Once the callback completes the service completes the RPC; if the callback fails then it fails the RPC?

Since a failure of any service should cause a restart, it seems the "name" part of the health service isn't useful. Instead the liveness service should maybe loop through all of its services and issue their callbacks. That does sound bad though, as it could take a while to run.

Maybe have services register themselves and have the liveness service be a watchdog? Each service registered would be responsible for "pinging" the liveness service at their registered frequency. If they fail to do so, then the liveness service starts responding "dead".

The k8s documentation keeps using /healthz for liveness. In that case it sort of seems like using the same Health service we have now, but maybe services just register a _nonrecursive variant of their name. Then instead of hitting /grpc.health.v1.Health/Check you could hit /grpc.health.v1.Health/Check?service=mine_nonrecursive or /grpc.health.v1.Health/Check/mine_nonrecursive

wekb · 2019-04-17T17:49:05Z

(Kubernetes probe definitions)

Okay, just by reading the definitions: gRPC health is k8s readiness. gRPC doesn't have a liveness service, although honestly I tend to expect those to be call outs from the service to a watchdog. But I can still understand the calling being inverted.

Yes, the HTTP calls are /from/ k8s to the service on, say, /health and /ready.

The proposed change essentially duplicates the Health GRPC and wraps it around the new "Readiness" name.

Liveness needs to check that the service is actually functional. How is it expected to work? Each service in a server registers a callback to the readiness service under a particular name. When the readiness service is called, the service calls the callback for the name requested. Once the callback completes the service completes the RPC; if the callback fails then it fails the RPC?

Yes, if the liveness check doesn't return a 200 in a timely fashion, the instance is considered dead and will be culled by k8s. If the readiness check doesn't return a 200 in a timely fashion, the instance is considered "stalled" and will be removed from the load balancer. The only criteria k8s judges these calls is by a 2xx status code; the body isn't utilized.

Since a failure of any service should cause a restart, it seems the "name" part of the health service isn't useful. Instead the liveness service should maybe loop through all of its services and issue their callbacks. That does sound bad though, as it could take a while to run.

I'm not sure I'm grokking what you mean by callbacks. The l/r checks are simple inbound HTTP requests, which the GRPC service will require transcoding to GRPC, and each of these RPCs will return 2xx or not appropriately, based on the business logic of liveness and (primarily) readiness (e.g., MySQL is down so I can't function correctly right now).

Maybe have services register themselves and have the liveness service be a watchdog? Each service registered would be responsible for "pinging" the liveness service at their registered frequency. If they fail to do so, then the liveness service starts responding "dead".

This is an interesting approach, but it's not what k8s supports AFAIK.

The k8s documentation keeps using /healthz for liveness. In that case it sort of seems like using the same Health service we have now, but maybe services just register a _nonrecursive variant of their name. Then instead of hitting /grpc.health.v1.Health/Check you could hit /grpc.health.v1.Health/Check?service=mine_nonrecursive or /grpc.health.v1.Health/Check/mine_nonrecursive

Again, I think this could work, but I'm not sure this workable in the context of pure transcoding, ala grpc-gateway and Envoy et al. Given a GRPC service using reflection in the context of Go, how would one register the same GRPC service as two different "endpoints"?

That would probably be a viable option, but if it's possible we haven't seen exemplars.

wekb · 2019-04-25T17:58:19Z

I also wanted to point out that the alternative that "everyone" will need to adopt is to duplicate the health grpc as a separate live/readiness service in each project or in a shared repo for every engineering org. Example below. That much duplication of code infers to me that perhaps it belongs in one place.

message LivenessCheckRequest {
  string service = 1;
}

message LivenessCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

message ReadinessCheckRequest {
  string service = 1;
}

message ReadinessCheckResponse {
  enum ServingStatus {
    UNKNOWN = 0;
    SERVING = 1;
    NOT_SERVING = 2;
  }
  ServingStatus status = 1;
}

service HealthQuery {
    rpc LivenessCheck (health.LivenessCheckRequest)
      returns (LivenessCheckResponse) {
        option (google.api.http) = {
            get: "/health"
        };
      }

    rpc ReadinessCheck (ReadinessCheckRequest)
               returns (ReadinessCheckResponse) {
        option (google.api.http) = {
            get: "/ready"
        };
      }
}

wekb · 2019-05-10T00:23:54Z

@ejona86 Any additional comments or feedback? Thanks.

wekb · 2019-09-04T22:49:18Z

@ejona86 It would be great to make progress here. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

wekb commented Apr 17, 2019

ejona86 commented Apr 17, 2019

wekb commented Apr 17, 2019

wekb commented Apr 25, 2019

wekb commented May 10, 2019

wekb commented Sep 4, 2019

A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

A second health GRPC endpoint to differentiate Kubernetes liveness/readiness checks (e.g.) #53

Comments

wekb commented Apr 17, 2019

ejona86 commented Apr 17, 2019

wekb commented Apr 17, 2019

wekb commented Apr 25, 2019

wekb commented May 10, 2019

wekb commented Sep 4, 2019