Meteor oplog query causing MongoDB to use 100% CPU #11993

arggh · 2022-03-30T12:02:44Z

We're trying to migrate to the latest Meteor (or 2.6.1 to be specific) and latest MongoDB (5.0.6).

For reasons unknown, apparently Meteor's oplog queries get flagged as "slow queries", and the number of those queries keep creeping up until our MongoDB is using all of it's CPU to handle them.

Example output from db.currentOp({"secs_running": {$gte: 3}, op: 'query' }):

JSON

{
  "t": { "$date": "2022-03-27T16:19:14.457+00:00" },
  "s": "I",
  "c": "COMMAND",
  "id": 51803,
  "ctx": "conn4072",
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "local.oplog.rs",
    "command": {
      "find": "oplog.rs",
      "filter": {
        "ns": {
          "$regularExpression": {
            "pattern": "^(?:vendorsdev\\.|admin\\.\\$cmd)",
            "options": ""
          }
        },
        "$or": [
          { "op": { "$in": ["i", "u", "d"] } },
          { "op": "c", "o.drop": { "$exists": true } },
          { "op": "c", "o.dropDatabase": 1 },
          { "op": "c", "o.applyOps": { "$exists": true } }
        ],
        "ts": { "$gt": { "$timestamp": { "t": 1648249369, "i": 2 } } }
      },
      "tailable": true,
      "awaitData": true,
      "lsid": { "id": { "$uuid": "fabe6175-5735-47eb-a45c-96cf1c347a3a" } },
      "$clusterTime": {
        "clusterTime": { "$timestamp": { "t": 1648397948, "i": 1 } },
        "signature": {
          "hash": {
            "$binary": {
              "base64": "lNULMeGvId5JQPHrco+HO/pzQq4=",
              "subType": "0"
            }
          },
          "keyId": 7074586468303241220
        }
      },
      "$db": "local"
    },
    "planSummary": "COLLSCAN",
    "cursorid": 430247436557313352,
    "keysExamined": 0,
    "docsExamined": 174341,
    "numYields": 195,
    "nreturned": 0,
    "reslen": 228,
    "locks": {
      "Global": { "acquireCount": { "r": 196 } },
      "Mutex": { "acquireCount": { "r": 1 } }
    },
    "readConcern": { "level": "local", "provenance": "implicitDefault" },
    "storage": {},
    "remote": "10.0.2.3:48140",
    "protocol": "op_msg",
    "durationMillis": 1078
  }
}

Stopping the Meteor app instantly reduces MongoDB CPU usage back to ~2%.

When restarting the Meteor app, it runs smoothly at first, but eventually the queries accumulate and the CPU usage jumps.

A timeline of (~zero usage) staging environment DB server CPU graph where you can see it become progressively worse, until we finally kill the Meteor app (sidenote: can't remember if and how many app restarts there were in the timeline)

Probably a crucial detail is that we are connecting from our app to another database using a configuration string like so:

const opts = isRemote ? // user 'vendorsdev' database instead of the default
  {
    _driver: new MongoInternals.RemoteCollectionDriver(
      mongoUrl, 
      {
        oplogUrl
      }
    )
  } : 
  {};

export const Documents = new Mongo.Collection('documents', opts);

...and all those "slow query" commands appear to be about the remote database ("vendorsdev"), not the main database, since they contain this pattern-field:

"pattern": "^(?:vendorsdev\\.|admin\\.\\$cmd)",

The version of Meteor showing the problem: 2.6.1
The operating system you're running Meteor on: Ubuntu 20.04
The expected behavior: modest CPU usage
The actual behavior: 100% CPU usage

The text was updated successfully, but these errors were encountered:

arggh · 2022-04-03T07:09:13Z

I believe I have an explanation. There might be some gaps or flaws in my understanding of oplog tailing, or oplog in general, so I'm not yet 100% certain, but I have managed to reproduce and prevent the problem by taking certain actions based on my hypothesis, so... promising.

In our example scenario, we have two somewhat interconnected apps which mostly use their own databases, called app1 and app2. The databases do however exist in the same MongoDB replica set, so they share a common oplog. App1 didn't basically have any usage, it just laid dormant, while app2 was being tested either manually or automatically (E2E via Cypress, load testing).

Every single time, these tests caused thousands of log messages complaining about slow oplog queries from Meteor, evergrowing in their duration, until the CPU usage of our MongoDB reached ~100%. This could also be seen in our monitoring dashboard's "Document Query Executor" graph:

🟥 The red rectangle is the time range where we ran a load test on app2.

🟦 The blue rectangle is after the load test - showcasing how the amount of scanned objects remains high even after the test concluded.

What is happening here, according to my hypothesis, is that the oplog tailing from the dormant app1 is constantly trying to find any entries that would belong to it, thus being forced to scan hundreds of thousands of documents in the oplog that is filled only with entries concerning operations in app2's database.

The fix?

Just make sure there's at least one entry in the oplog for every oplog tailing process (eg. Meteor app) every now and then ✨

For example, in a very ad-hoc manner, I wrote a Puppeteer script to create a new user in the dormant, not-being-tested app1 every 5 seconds or so while running the load tests on app2.

Here's how the same load test went after this "fix":

🟪 The purple rectangle is when we restarted the dormant app(s) that were stuck tailing the oplog for entries they couldn't find.

🟩 The green rectangle is another load test run, the only change being a few dummy transactions generated in the dormant app's database every 5 seconds. The DB load never got out of control and returned to normal after the load test ended.

The implications

In a setup of two or more Meteor apps sharing the same cluster or replica set, all using oplog tailing, the database will eventully slow down to a crawl when there's no activity for some period of time in any (single or multiple) of the apps, because their oplog tailing process will have to keep scanning through the entire oplog trying to find relevant entries.

(Possibly related to #10808, an unresolved and closed issue from 2019 about slow oplog tailing.)

radekmie · 2022-04-04T10:18:02Z

I can confirm that the exact same thing happens to us for a long time now (at least a couple of years). We moved everything that was easy to other clusters (new or at least ones with smaller write throughput) and it helped. Reducing the oplog size also helped (docs).

It'd be extremely easy and cheap to add a scheduled trigger that'd do something in each database. It could even have a separate collection that would receive a single insert and then get dropped immediately after.

lynchem · 2022-06-08T16:13:57Z

So we had a similar-ish case so I thought I'd comment here rather than open a new issue. We had staging and production in the same cluster and when I launched 2.7 (previously 2.5) to staging I noticed our DB cluster started sending more alerts than usual with slow queries and 20k docs scanned (when usually our alarm level is just >2k).

These graphs are Process CPU & Disk Utilisation respectively.

Where it drops off is when I deployed staging using a separate cluster.

So now with the staging DB moved to it's own cluster when there's no users on staging it's constantly firing the scanned document is >20K or even >30K. If I login and start doing regular operations it stops. I can switch off alarms for staging so it doesn't really matter but I'm worried that deploying this to production will also have these alarms go off when all our users are sleeping or in these brief periods of inactivity.

Is there a better way of solving this than writing some dummy record to the DB every few seconds? It'd be great to understand what the underlying changes to the oplog were in 2.6 that have provoked this. I saw mention of
we are maintaining a translation layer for the new oplog format, so if you read or rely on any behavior of it please read our oplogv2converter.js code
Is there other work planned on this ?

fredmaiaarantes · 2022-06-10T18:52:57Z

Hey! Afaik, the translation layer wouldn't be the cause as it's just reading the new messages from the oplog and converting them to the format that Meteor understands.
Unfortunately, we have not yet been able to prioritize this issue, but we will try to prioritize it.

The code for the translation layer is here, and there are tests for it in case anyone wants to check it out in the meantime.

lynchem · 2022-06-12T17:31:49Z

Thanks @fredmaiaarantes for looking at this. Following on from @arggh 's hypothesis.

What is happening here, according to my hypothesis, is that the oplog tailing from the dormant app1 is constantly trying to find any entries that would belong to it, thus being forced to scan hundreds of thousands of documents in the oplog that is filled only with entries concerning operations in app2's database.

If there's just one DB on the cluster but it's idle couldn't that have the same issue? i.e. scanning the entire oplog and causing all these scanned documents warnings and the high disk utilisation ?

lynchem · 2022-07-04T16:41:02Z

Any update on this @fredmaiaarantes ? This is still blocking up from upgrading past 2.5.8

evolross · 2022-07-07T06:04:31Z

So you’re all using Meteor’s mongo oplog tailing? Wasn’t that considered bad form back in 2015 due to scaling issues, hence redis-oplog, etc? Or are you using it for a different purpose?

derwaldgeist · 2022-07-11T19:24:53Z

If there's just one DB on the cluster but it's idle couldn't that have the same issue? i.e. scanning the entire oplog and causing all these scanned documents warnings and the high disk utilisation ?

This might actually be the root cause for #10808 / #11578. We're only using one database on the cluster, and get error messages regularly. There is not much going on in the database for some time periods, which might explain why this is happening.

lynchem · 2022-07-12T08:38:14Z

Yeah, I think they're all related @derwaldgeist although I think the addition of the oplogv2converter in 2.6 made matters worse. Running 2.5 I don't see any mention of the oplog in slow queries but move to 2.6 and it's constant. It'd be great if we could get someone from Meteor to have a look at this @fredmaiaarantes.

fredmaiaarantes · 2022-07-13T08:45:41Z

Yeah, I think they're all related @derwaldgeist although I think the addition of the oplogv2converter in 2.6 made matters worse. Running 2.5 I don't see any mention of the oplog in slow queries but move to 2.6 and it's constant. It'd be great if we could get someone from Meteor to have a look at this @fredmaiaarantes.

Hey! Thanks for sharing more scenarios and related issues.
We plan for next week to try to reproduce the reported scenarios to find the root cause of the problem.

bricous · 2022-07-27T15:27:09Z

Hi,
Maybe related ? My single instance replicaset mongodb 4 came near 100% cpu after moving my app to 2.6 and all became very very slow. I needed to upgrade mongodb from 4 to 4.4 to return to usual values <3%. 4.2 did not resolved 100% cpu. I have never tried mongodb > 4.4.

lynchem · 2022-07-27T15:51:53Z

That's interesting @bricous. I hadn't even thought about the mongo server version. We're running 4.2 and I see in the changelog for 4.4 that a variety of things were changed related to the oplog. I'll try with a cluster running 4.4 and see if it goes away.

sampaiodiego · 2022-07-27T20:32:22Z

I have found the new MongoDB driver has removed support to the flag oplogReplay because starting from MongoDB 4.4 it is automatically applied by the MongoDB server. But for MongoDB versions <=4.2, trying to tail the oplog on big servers might take a long time without it. In fact, after we upgraded Rocket.Chat to Meteor 2.7, the oplog query started to timeout in some servers, causing performance issues to MongoDB servers due to a COLLSPAN in the entirely oplog.rs collection and real time data not working at in Rocket.Chat.

I've filled this issue to MongoDB Node driver: https://jira.mongodb.org/projects/NODE/issues/NODE-4467

I've also submitted this PR hoping they accept it: mongodb/node-mongodb-native#3337

derwaldgeist · 2022-08-03T10:18:43Z

I'll try with a cluster running 4.4 and see if it goes away.

Stupid question: How do you swap out the driver? I thought it's baked into the Meteor framework somehow.

lynchem · 2022-08-03T11:26:20Z

Stupid question: How do you swap out the driver? I thought it's baked into the Meteor framework somehow.

Yeah, the node driver version is determined by the Meteor version you're using. What I was referring to was the mongo server version. So I believe upgrading to 4.4 on our server might make the issue dissapear.

derwaldgeist · 2022-08-04T21:20:14Z

Thanks for the clarification!

Grubba27 · 2022-08-10T15:16:32Z

Hello there! I'm currently working on this, and I do get some big metrics on an app on version 2.7.4 I will try to downgrade versions until I find(testing currently 2.5.6) the one that does not consume so many resources. I'm looking also at what changed between these versions in OpLog to have this much of an impact.

sampaiodiego · 2022-08-10T16:54:06Z

hi @Grubba27 the changes you're looking for were introduced on version 4.0 of mongodb node driver. I've included more details on the issue I opened to them https://jira.mongodb.org/browse/NODE-4467

tldr starting from mongodb@4.0 they're not passing the oplogReplay cursor flag through the wire, which causes issues on MongoDB 4.2 and below.. MongoDB server starting from 4.4 applies this flag automatically, that's why they removed.

I've already provided a patch to the mongodb node driver as you can see mongodb/node-mongodb-native#3337 .. but apparently this will be shipped on 4.9 only.

Floriferous · 2022-08-30T08:34:18Z

We're not 100% sure yet, but it seems like our update from 2.5.2 to 2.7.3 just crashed our entire production deployment - and it seems like this might be the culprit.

EDIT: We reverted all our systems back to 2.5.2 and the problem went away. This was a scary one for us, as it took 4 days since the deploy of 2.7 until the slowdown happened, making the diagnosis a bit more challenging.

lynchem · 2022-08-30T09:23:25Z

What version of MongoDB is your server running @Floriferous ?

radekmie · 2022-08-30T09:30:37Z

Also, the MongoDB driver 4.9 includes a potential fix for this, which will be released in the next 2.8 beta.

fredmaiaarantes · 2022-09-12T13:34:46Z

This potential fix is out on 2.8-beta.7, right @Grubba27?

Grubba27 · 2022-09-12T13:55:42Z

Yes!! I've been doing some tests using this new beta version, and I think we are reaching a solution for this matter.
A big thanks to @sampaiodiego for its PR!
If anyone wants to give the new beta version a try, you can use:
meteor update --release 2.8-beta.7

Floriferous · 2022-09-22T12:31:33Z

What version of MongoDB is your server running @Floriferous ?

We're on 4.2.22

Grubba27 · 2022-11-08T03:06:40Z

Hey @arggh. This is solved in 2.8.x
if this may occur again for any reason, I will be happy to open this issue again

StorytellerCZ added the Project:Mongo Driver label Apr 2, 2022

StorytellerCZ mentioned this issue Apr 2, 2022

DB connection issues with Meteor 2.7 (possibly 2.6 too) #11992

Closed

lynchem mentioned this issue Jun 10, 2022

Slow oplog tailing on ATLAS (reactivated issue) #11578

Open

sampaiodiego mentioned this issue Sep 19, 2022

when we use rocket.chat 3.9.4 version,without (USE_NATIVE_OPLOG=true),our mongodb will have so many getmore operations and the mongodb's cpu is 99%...... RocketChat/Rocket.Chat#20027

Closed

Grubba27 closed this as completed Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meteor oplog query causing MongoDB to use 100% CPU #11993

Meteor oplog query causing MongoDB to use 100% CPU #11993

arggh commented Mar 30, 2022

arggh commented Apr 3, 2022

radekmie commented Apr 4, 2022

lynchem commented Jun 8, 2022 •

edited

fredmaiaarantes commented Jun 10, 2022 •

edited

lynchem commented Jun 12, 2022

lynchem commented Jul 4, 2022

evolross commented Jul 7, 2022

derwaldgeist commented Jul 11, 2022 •

edited

lynchem commented Jul 12, 2022

fredmaiaarantes commented Jul 13, 2022

bricous commented Jul 27, 2022

lynchem commented Jul 27, 2022

sampaiodiego commented Jul 27, 2022

derwaldgeist commented Aug 3, 2022

lynchem commented Aug 3, 2022

derwaldgeist commented Aug 4, 2022

Grubba27 commented Aug 10, 2022

sampaiodiego commented Aug 10, 2022

Floriferous commented Aug 30, 2022 •

edited

lynchem commented Aug 30, 2022

radekmie commented Aug 30, 2022

fredmaiaarantes commented Sep 12, 2022

Grubba27 commented Sep 12, 2022

Floriferous commented Sep 22, 2022

Grubba27 commented Nov 8, 2022

Meteor oplog query causing MongoDB to use 100% CPU #11993

Meteor oplog query causing MongoDB to use 100% CPU #11993

Comments

arggh commented Mar 30, 2022

arggh commented Apr 3, 2022

The fix?

The implications

radekmie commented Apr 4, 2022

lynchem commented Jun 8, 2022 • edited

fredmaiaarantes commented Jun 10, 2022 • edited

lynchem commented Jun 12, 2022

lynchem commented Jul 4, 2022

evolross commented Jul 7, 2022

derwaldgeist commented Jul 11, 2022 • edited

lynchem commented Jul 12, 2022

fredmaiaarantes commented Jul 13, 2022

bricous commented Jul 27, 2022

lynchem commented Jul 27, 2022

sampaiodiego commented Jul 27, 2022

derwaldgeist commented Aug 3, 2022

lynchem commented Aug 3, 2022

derwaldgeist commented Aug 4, 2022

Grubba27 commented Aug 10, 2022

sampaiodiego commented Aug 10, 2022

Floriferous commented Aug 30, 2022 • edited

lynchem commented Aug 30, 2022

radekmie commented Aug 30, 2022

fredmaiaarantes commented Sep 12, 2022

Grubba27 commented Sep 12, 2022

Floriferous commented Sep 22, 2022

Grubba27 commented Nov 8, 2022

lynchem commented Jun 8, 2022 •

edited

fredmaiaarantes commented Jun 10, 2022 •

edited

derwaldgeist commented Jul 11, 2022 •

edited

Floriferous commented Aug 30, 2022 •

edited