High availability support for Federation #349

shaneu · 2019-11-21T17:03:25Z

First let me say we love federation on our team. We split our monolith into approx. 11 microservices and have been very pleased with the ease of the refactor and the results.

There were some issues around availability that we had to roll ourselves, and it led me to think maybe these would be better included in the federation/gateway package.

First of all, the gateway attempts to connect to the federated services, and if it is unable it still get's initialized, only without the services it failed to connect too. This is troublesome for an application that is deployed automatically as it requires human intervention to monitor if the gateway initialization was successful.

Our workaround was to create a readinessProbe that performs the same query the gateway makes to the federated services, with the goal being to make sure the services are available and ready to receive traffic. This probe has a configurable number of retries and timeout before throwing. If the probe is unsuccessful we let the application to crash, as it doesn't make sense for us to allow it to initialize without all services being available.

Perhaps this feature could be built into the gateway, along with a configurable number of retries? If it is unsuccessful it crashes and does not init. Perhaps that too could be a configuration mustConnect: true/false (I'm sure there's a better name for the property).

Other than that we let kubernetes manage the availability of the other services via liveness probes, but for teams not using a deployment orchestration manager like a kubernetes perhaps there is a way by which the gateway could check the availability of its downstream services?

Also, a periodic reloading of the schema would be helpful. Let's take the case where we update a schema in one of our services and redeploy it. The gateway wouldn't know the service's schema had changed so we must restart the gateway for it to get the changes. This hurts availably by causing downtime while the gateway is redeployed. If the gateway would refresh its schema on a configurable polling interval that issue could be avoided. We've developed a hack using helm charts to tell the gateway to redeploy when one of the downstream services have been updated, but that still causes downtime.

If any of these ideas seems like it would be a good addition to the project I would be more than happy to make a PR. I am truly appreciate of the work the apollo team is doing and I'd love to contribute.

The text was updated successfully, but these errors were encountered:

moimikey · 2019-12-13T00:09:38Z

upvoteeeee 👏 👏 👏 👏 👏

rtymchyk · 2020-02-05T18:46:28Z

It sounds to me like managed federation solves both of these issues? In managed mode, you will rarely need to re-deploy the gateway. Once it's up and running, it will get updated as federated services notify it via service:push.

shaneu · 2020-02-10T16:51:56Z

managed isn't an option for every team, mine included. It would be nice to have this built into the gateway or a service that can be deployed along side the gateway

rtymchyk · 2020-02-10T16:57:02Z

@shaneu can you explain why managed isn't an option for your team? I am just curious and playing the devil's advocate :)

shaneu · 2020-02-10T17:10:59Z

@rtymchyk I work for an enterprise bank that due to internal policy and compliance rules must manage all infrastructure on prem. Unless I'm incorrect, Apollo Graph Manager does not have an on prem solution. I'm sure my teams situation in those regards is not unique.

alanhoff · 2020-02-20T17:39:26Z

I'm also blocked by this... If a federated container needs to restart then the gateway becomes pretty much useless since it wont try to reconnect/reload and will just start throwing 503

rtymchyk · 2020-02-20T18:05:21Z

@alanhoff Is this an an issue with federation? If your container needs to restart, you should be doing rolling restarts (or rolling updates for deployments). i.e. there shouldn't be a state where your service is down during a restart/deploy, regardless of whether or not it is used for federation.

Quantumplation · 2020-02-21T21:57:14Z

@rtymchyk
That works in a perfect world, but making rolling deployments a requirement of the infrastructure / CICD pipelines limits the audience significantly, and can be quite fragile.

When spinning up new ephemeral environments, for example, Apollo federation induces a startup order on services. In very large organizations with hundreds of microservices, this kind of hard coded ordering of services is a PAIN to maintain.

As another example, when doing continuous deployment, a new version of a federated service means that you need to remember to redeploy the Apollo gateway(s) that depend on that service, hard coding this interdependency in the build pipeline, rather than in the service code itself.

Like @shaneu mentioned above, a managed-off-prem dependency is a non-starter for us.

Ideally, Apollo federation would have a mechanism to indicate "hey, the schema graph you constructed on startup may be out of date, please re-fetch from the same services.", to allow us to orchestrate the refresh policy in a way that meets our own needs.

Glen-Moonpig · 2020-05-06T15:56:15Z

Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service.

AndreBtt · 2021-01-05T17:22:35Z

@shaneu You said that you manage the availability of your services using kubernetes, how do you update the gateway when some service goes down or a new service is available ?

I'm asking you that because I'm also using kubernetes and to manage these services I dynamic update the array Apollo Gateway uses on serviceList.

Take this code as an example:

const gateway = new ApolloGateway({ serviceList: services });

Once a service goes down or goes up, I update services array and use gateway.load(); to update the gateway without turn the server down. I'm not comfortable with this solution because is seems that load function restart all my services not only the one changed.

Is there a better way to do this?

Nicoowr · 2021-06-04T08:47:14Z

@rtymchyk Does the gateway immediately updates its configuration whenever the registry is updated in managed mode? I thought the gateway would poll every interval of time the registry, which could introduce short disruptions on schema changes.

rtymchyk · 2021-06-04T14:59:06Z

@Nicoowr It's not immediate, but it's close. It shouldn't matter too much though if your API evolution is always backwards compatible.

Nicoowr · 2021-06-04T15:11:55Z

@rtymchyk very clear thanks. As a side note, breaking changes happen from time to time in our case, hope this won't be a problem though.

rhzs · 2021-06-10T14:55:26Z

Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service.

After a year, I don't see improvement on this one. I am curious why this is not highest priority / critical. Does one use federated GraphQL in production with high scale and not having this issue and blocker?

Can anyone point me best practices repo example to deploy my GraphQL services with CI/CD pipeline that accommodate re-initialisation if some services are offline?

rtymchyk · 2021-06-10T14:59:17Z

Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service.

After a year, I don't see improvement on this one. I am curious why this is not highest priority / critical. Does one use federated GraphQL in production with high scale and not having this issue and blocker?

Can anyone point me best practices repo example to deploy my GraphQL services with CI/CD pipeline that accommodate re-initialisation if some services are offline?

Are you talking about initial boot or mid lifecycle?

glasser · 2021-06-10T15:01:05Z

Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service.

After a year, I don't see improvement on this one. I am curious why this is not highest priority / critical. Does one use federated GraphQL in production with high scale and not having this issue and blocker?

@rhzs The relevant change here is in Apollo Server. Starting with AS v2.22, the listen() method of apollo-server or the new start() method if you're using an integration like apollo-server-express will throw if the schema is not loaded properly on startup. If startup succeeds, the polling timer will get started and should run reliably (including if future attempts to update the schema have errors). There have also been some reliability improvements in Gateway itself around the polling logic, most notalby in Gateway 0.23.

rtymchyk · 2021-06-10T15:06:34Z

@glasser's change to listen has been working great in our Kubernetes environment :) No more dead pods on boot.

kevinschaffter · 2021-06-10T16:17:59Z

Apart from going down the path of managed federation, is there any way to prevent the entire GQL gateway from crashing if one of the federated services is not available?

rhzs · 2021-06-12T05:43:11Z

Are you talking about initial boot or mid lifecycle?

Initial boot. mid lifecycle works (when service went down and back up again).

Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service.

After a year, I don't see improvement on this one. I am curious why this is not highest priority / critical. Does one use federated GraphQL in production with high scale and not having this issue and blocker?

@rhzs The relevant change here is in Apollo Server. Starting with AS v2.22, the listen() method of apollo-server or the new start() method if you're using an integration like apollo-server-express will throw if the schema is not loaded properly on startup. If startup succeeds, the polling timer will get started and should run reliably (including if future attempts to update the schema have errors). There have also been some reliability improvements in Gateway itself around the polling logic, most notalby in Gateway 0.23.

if start up succeeds, does that mean we need to resolve all services first? IMAGINE, if I have 100s services in production. Does that make sense to wait 100s services UP and RUNNING in order the Gateway "initial startup" to succeeds?

I have noticed the improvement on the pooling for handling service unavailability (Kudos for the team. ), but that only works after you success starting (startup) all the services.

glasser · 2021-06-14T19:41:56Z

Yes, your server can't start unless it knows its schema, which requires knowing all the subgraph schemas or being started directly with a supergraph schema. Polling a bunch of individual subgraphs is certainly a big process which is part of why we've developed managed federation and other mechanisms of getting schemas into Gateway.

rhzs · 2021-06-15T01:05:01Z

@glasser that should be at least in this projectREADME / Apollo Docs about Apollo Federation scalability. Not much articles discussing about Apollo Federation scalability until I experienced it on my own.

rhzs · 2021-06-15T01:09:47Z

@glasser I am curious to know, can't we just check the GraphQL schema hash (using SHA) and cache the schema in local file, rather than fetched all remote schemas again, and only fetched the necessary remote schema (or maybe just part of it)?

glasser · 2021-06-15T05:31:29Z

Sure, that is something you could build with the primitives.

j · 2021-06-25T22:16:42Z

tldr; Please allow cached federated service configuration for HA.

@rhzs That makes way too much sense to do though. =P

Just like most GraphQL implementations allow you to emit a schema file (useful for development & having other libraries utilize schema.. i.e. client side generators, etc).

If Apollo Federation did something similar, that'd make a ton more sense. Yes, maybe in development mode the gateway can fail if a service isn't live, but in production mode, if you can use a cached file describing the services to fall back on in case of a service being down, then to me that would be a whole lot better than the gateway failing.

Likely scenario: I have a gateway running on GCR (docker) and all my services running on GCR. If a service goes down and GCR scales my gateway instances up, I'll end up with a bunch of failed docker containers when I'd rather have the service queue up requests and serve gateway timeout errors after a certain time. Ideally still servicing all other service requests that are up. I get that yeah we can have this entirely huge E2E test suite that runs on CI before going live, but sometimes we have to do a rushed push (configuration changes, etc).

Also, let's say that there's less mission critical services (admin analytics) and mission critical services (payment processing) and we break the admin analytics somehow or have downtime on that service.... our entire application is down? We're not making money?

This is one of the bigger reasons people move to micro service architecture. I keep seeing "devil's advocate" responses pushing people to the managed services instead of actually realizing this can be an issue, though I get your reasonings for not pushing people to a good solution?

I worked all day on moving to Federation but think I'm going to go back to a hand-coded GraphQL "gateway" with gRPC micro-services because I ran into this issue and can see where it'll become a PITA in production and cause revenue loss.

/cc @rtymchyk @glasser

j · 2021-06-30T19:04:06Z

@rhzs I found a public attempt at re-creating a schema registry.

https://github.com/pipedrive/graphql-schema-registry

It's pretty much the route I was attempting (the example gateway).

Though they're extending ApolloGateway and recent changes to making everything private will likely break a library like this as the Apollo team doesn't want people to really extend ApolloGateway, etc (which I get, they just need to support things like this better and cleanly).

After playing with caching attempts, I'd like to figure out how to do the following (sudo-code):

class ApolloGatewayFactory {
  protected gateway: ApolloGateway;
  protected pubsub: PubSub;

  async create(): Promise<ApolloGateway> {
    const { supergraphSdl } = await this.load();

    this.gateway = new ApolloGateway({ supergraphSdl });

    this.pubsub.subscribe('schema_changed', async () => this.update());

    return this.gateway;
  }

  async load(): Promise<SupergraphSdlGatewayConfig> {
    // .. get service from wherever you'd like, however you'd like
  }

  async update(): Promise<SupergraphSdlGatewayConfig> {
    const { supergraphSdl } = await this.load();

    this.gateway.update({ supergraphSdl });
  }
}

async start(): Promise<void> {
  const factory = new ApolloGatewayFactory();

  const gateway = await factory.create();

  const server = new ApolloServer({
    gateway
  });

  // ...
}

This is one example. Here, I could build my own polling. I could create an endpoint to force reload the gateway. I could use etcd, consul, etc. I could use managed federation but create a fallback for higher availability just in case managed Apollo sees downtime.

Personally, I'd like to use "Managed Federation" just for CI and not service discovery. I'd like to be able to submit my schema changes and run validations, backwards-compatibility checks, etc. Then on success (or on failure if I'm feeling antsy), I want to deploy services however I want.

For us, we just don't rely on SaSS products simply because we've lost thousands and thousands of $ from downtime.

Once we did a $250k USD media buy for a single day of traffic and a service went down that we relied on.

Long story short, we lost about $100k due to that.

kindermax · 2021-06-30T19:07:43Z

There is another OSS registry https://github.com/StarpTech/graphql-registry. Just FYI.
For me, it seems that without any registry it will be very hard to deploy and support federated graphs in production (I am talking about on-prem registry). So it is either the Apollo team will opensource some kind of simple registry (which would be fantastic) or the community will figure it out on its own.

j · 2021-06-30T20:01:28Z

@kindermax That's cool too. Good to see people see the issue.

There just needs to be better baked in support into @apollo/* libraries for 3rd parties to really utilize. At any moment graphql-registry you mentioned can just break. I did see that https://github.com/mercurius-js/mercurius implemented their own federation support using the spec. Part of me feels sad that this is an option.

j · 2021-06-30T20:04:12Z

@kindermax Ironically, their examples already show similar code as what I mentioned:

https://mercurius.dev/#/docs/federation

setTimeout(async () => {
  const schema = await server.graphql.gateway.refresh()

  if (schema !== null) {
    server.graphql.replaceSchema(schema)
  }
}, 10000)

Borduhh · 2022-04-20T16:18:31Z

Looking at how IntrospectAndCompose() works, would it be easiest to just implement individual retry logic when it attempts to retrieve a data source?

Here is the code behind that:

return dataSource
      .process({
        kind: GraphQLDataSourceRequestKind.LOADING_SCHEMA,
        request,
        context: {},
      })
      .then(({ data, errors }): ServiceDefinition => {
        if (data && !errors) {
          const typeDefs = data._service.sdl as string;
          const previousDefinition = serviceSdlCache.get(name);
          // this lets us know if any downstream service has changed
          // and we need to recalculate the schema
          if (previousDefinition !== typeDefs) {
            isNewSchema = true;
          }
          serviceSdlCache.set(name, typeDefs);
          return {
            name,
            url,
            typeDefs: parse(typeDefs),
          };
        }

        throw new Error(errors?.map((e) => e.message).join('\n'));
      })
      .catch((err) => {
        const errorMessage =
          `Couldn't load service definitions for "${name}" at ${url}` +
          (err && err.message ? ': ' + err.message || err : '');

        throw new Error(errorMessage);
      });

Then if you have 100 services and service number 58 fails, it will only try to re-retrieve the schema for that service instead of trying to also re-retrieve services 1 - 57.

Happy to submit a PR if I am not missing something.

trevor-scheer · 2022-04-20T18:24:13Z

@shaneu I believe the gateway now behaves in the way you'd like (fails to start on init if subgraphs are unavailable). Additionally, the somewhat recent introduction of the supergraphSdl field (aka SupergraphManagers) permits a lot of flexibility in userland code as well (see #1246 for more details).

I see a handful of other concerns in this issue which I believe are also addressed by SupergraphManagers, but if not I'd welcome anyone to open a separate issue with a specific ask.

@Borduhh that sounds reasonable, would you mind opening a separate issue for discussion?

Thanks all. I'll gladly reopen this issue if the original issue is in fact not resolved.

abernix assigned trevor-scheer Mar 23, 2020

abernix transferred this issue from apollographql/apollo-server Jan 15, 2021

trevor-scheer added 2021-10 component/gateway runtime labels Oct 5, 2021

hwillson removed the 2021-10 label Mar 31, 2022

trevor-scheer closed this as completed Apr 20, 2022

Borduhh mentioned this issue Apr 26, 2022

Using IntrospectAndCompose with High Availability Micro Services #1784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High availability support for Federation #349

High availability support for Federation #349

shaneu commented Nov 21, 2019

moimikey commented Dec 13, 2019

rtymchyk commented Feb 5, 2020

shaneu commented Feb 10, 2020 •

edited

rtymchyk commented Feb 10, 2020

shaneu commented Feb 10, 2020

alanhoff commented Feb 20, 2020

rtymchyk commented Feb 20, 2020 •

edited

Quantumplation commented Feb 21, 2020 •

edited

Glen-Moonpig commented May 6, 2020

AndreBtt commented Jan 5, 2021 •

edited

Nicoowr commented Jun 4, 2021

rtymchyk commented Jun 4, 2021

Nicoowr commented Jun 4, 2021

rhzs commented Jun 10, 2021

rtymchyk commented Jun 10, 2021

glasser commented Jun 10, 2021

rtymchyk commented Jun 10, 2021

kevinschaffter commented Jun 10, 2021

rhzs commented Jun 12, 2021

glasser commented Jun 14, 2021

rhzs commented Jun 15, 2021

rhzs commented Jun 15, 2021

glasser commented Jun 15, 2021

j commented Jun 25, 2021 •

edited

j commented Jun 30, 2021 •

edited

kindermax commented Jun 30, 2021 •

edited

j commented Jun 30, 2021

j commented Jun 30, 2021

Borduhh commented Apr 20, 2022 •

edited

trevor-scheer commented Apr 20, 2022

High availability support for Federation #349

High availability support for Federation #349

Comments

shaneu commented Nov 21, 2019

moimikey commented Dec 13, 2019

rtymchyk commented Feb 5, 2020

shaneu commented Feb 10, 2020 • edited

rtymchyk commented Feb 10, 2020

shaneu commented Feb 10, 2020

alanhoff commented Feb 20, 2020

rtymchyk commented Feb 20, 2020 • edited

Quantumplation commented Feb 21, 2020 • edited

Glen-Moonpig commented May 6, 2020

AndreBtt commented Jan 5, 2021 • edited

Nicoowr commented Jun 4, 2021

rtymchyk commented Jun 4, 2021

Nicoowr commented Jun 4, 2021

rhzs commented Jun 10, 2021

rtymchyk commented Jun 10, 2021

glasser commented Jun 10, 2021

rtymchyk commented Jun 10, 2021

kevinschaffter commented Jun 10, 2021

rhzs commented Jun 12, 2021

glasser commented Jun 14, 2021

rhzs commented Jun 15, 2021

rhzs commented Jun 15, 2021

glasser commented Jun 15, 2021

j commented Jun 25, 2021 • edited

j commented Jun 30, 2021 • edited

kindermax commented Jun 30, 2021 • edited

j commented Jun 30, 2021

j commented Jun 30, 2021

Borduhh commented Apr 20, 2022 • edited

trevor-scheer commented Apr 20, 2022

shaneu commented Feb 10, 2020 •

edited

rtymchyk commented Feb 20, 2020 •

edited

Quantumplation commented Feb 21, 2020 •

edited

AndreBtt commented Jan 5, 2021 •

edited

j commented Jun 25, 2021 •

edited

j commented Jun 30, 2021 •

edited

kindermax commented Jun 30, 2021 •

edited

Borduhh commented Apr 20, 2022 •

edited