Skip to content

gRPC notes

Jacob Hoffman-Andrews edited this page Oct 5, 2017 · 5 revisions

[Note: This transition has been achieved and we now use gRPC exclusively for our RPC. We're quite happy with it.]

Our homegrown RPC has served us well but the amount of copy and paste required, the inconsistencies in serialization, the prospect of future bugs and feature requests for it, and its current bug track record make it undesirable to continue using. The copy-and-paste problem has been a serious detriment to our ability to modify and maintain and increase the security of our internal APIs. Plus, it's difficult to debug RPC timeouts in test and prod because it's difficult to understand where the slowdowns occur between client, RabbitMQ, and server instead of looking at just client and server.

Instead, we'd like to use gRPC.

GPRC is an RPC library and code generator with many tricks we'd like to see in our own RPC layer. Like our current RPC, gRPC has TLS support and uses full-fledged types to represement messages. But gRPC is also multiplexed, proxyable, includes the ability to deadline requests and cancel them from the client, has authentication support, provides ways to include metadata for a request like trace ids, is built on top of HTTP/2 (which is why the first attributes listed here were possible), includes a code generator so we can reduce the hefty copy-and-paste costs to making and changing RPCs, it's cross-language, and it's open source and supported by Google, a reputable engineering organization. And, since gRPC is on top of HTTP/2, we could replace our current MySQL-over-VPN connections to the other data center (that are not multiplexed, secured only by the VPN, and with a protocol not designed for traveling over large latencies) by contacting the gRPC'ed SA in another datacenter instead of the MySQL database.

There is one additional complication with moving to gRPC. One of the things our RabbitMQ RPC layer had was the lack of coordinating which server process RPC messages would be handled by any given client request. The servers would all just pluck off the queue of incoming requests and respond to the correct randomly-generated client queue.

For gRPC, we're going to have to configure the clients so that they know which hosts to talk to.

One implementation of this would be to add command line flags to each service for each set of hosts they need to talk to. For instance, the WFE would gain a -ra_hosts and -sa_hosts flags that would be comma separated IP addresses and ports (or hostnames and ports). An alternative would be to ship a config file mapping service names to IP addresses or host names (with ports) for the servers that can satisfy those services' requests.

In the actual code, boulder would implement a Picker that would randomly choose between the hosts given. That interface is labeled as experimental, but has been implemented privately enough that folks are just waiting on a load-balanced Picker (mentioned in the gRPC mailing list to remove that annotation.

Currently, gRPC doesn't have a built-in way to intercept requests and responses for easy, say, latency calculations. Instead of a bunch of copy and paste to add them, we would use the grpcinstrument plugin for the protobuf compiler. That plugin generates an Instrumentator interface and builder functions we would use to gather metrics.

With those pieces in place, we'd then run the gRPC clients and servers in parallel to the current RabbitMQ RPC code behind config flags for each RPC method. We could maintain the current code paths with some translation from gRPC generated types to the boulder structs. Along the way, we could slim down the old-style APIs before doing so in order to make the translations easier.

After the migration is finished for a service, we'd delete the old RabbitMQ code.