Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meteor oplog query causing MongoDB to use 100% CPU #11993

Closed
arggh opened this issue Mar 30, 2022 · 25 comments
Closed

Meteor oplog query causing MongoDB to use 100% CPU #11993

arggh opened this issue Mar 30, 2022 · 25 comments

Comments

@arggh
Copy link
Contributor

arggh commented Mar 30, 2022

We're trying to migrate to the latest Meteor (or 2.6.1 to be specific) and latest MongoDB (5.0.6).

For reasons unknown, apparently Meteor's oplog queries get flagged as "slow queries", and the number of those queries keep creeping up until our MongoDB is using all of it's CPU to handle them.

Example output from db.currentOp({"secs_running": {$gte: 3}, op: 'query' }):

JSON
{
  "t": { "$date": "2022-03-27T16:19:14.457+00:00" },
  "s": "I",
  "c": "COMMAND",
  "id": 51803,
  "ctx": "conn4072",
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "local.oplog.rs",
    "command": {
      "find": "oplog.rs",
      "filter": {
        "ns": {
          "$regularExpression": {
            "pattern": "^(?:vendorsdev\\.|admin\\.\\$cmd)",
            "options": ""
          }
        },
        "$or": [
          { "op": { "$in": ["i", "u", "d"] } },
          { "op": "c", "o.drop": { "$exists": true } },
          { "op": "c", "o.dropDatabase": 1 },
          { "op": "c", "o.applyOps": { "$exists": true } }
        ],
        "ts": { "$gt": { "$timestamp": { "t": 1648249369, "i": 2 } } }
      },
      "tailable": true,
      "awaitData": true,
      "lsid": { "id": { "$uuid": "fabe6175-5735-47eb-a45c-96cf1c347a3a" } },
      "$clusterTime": {
        "clusterTime": { "$timestamp": { "t": 1648397948, "i": 1 } },
        "signature": {
          "hash": {
            "$binary": {
              "base64": "lNULMeGvId5JQPHrco+HO/pzQq4=",
              "subType": "0"
            }
          },
          "keyId": 7074586468303241220
        }
      },
      "$db": "local"
    },
    "planSummary": "COLLSCAN",
    "cursorid": 430247436557313352,
    "keysExamined": 0,
    "docsExamined": 174341,
    "numYields": 195,
    "nreturned": 0,
    "reslen": 228,
    "locks": {
      "Global": { "acquireCount": { "r": 196 } },
      "Mutex": { "acquireCount": { "r": 1 } }
    },
    "readConcern": { "level": "local", "provenance": "implicitDefault" },
    "storage": {},
    "remote": "10.0.2.3:48140",
    "protocol": "op_msg",
    "durationMillis": 1078
  }
}

Stopping the Meteor app instantly reduces MongoDB CPU usage back to ~2%.

When restarting the Meteor app, it runs smoothly at first, but eventually the queries accumulate and the CPU usage jumps.

A timeline of (~zero usage) staging environment DB server CPU graph where you can see it become progressively worse, until we finally kill the Meteor app (sidenote: can't remember if and how many app restarts there were in the timeline)

Screenshot 2022-03-30 at 14 51 19

Probably a crucial detail is that we are connecting from our app to another database using a configuration string like so:

const opts = isRemote ? // user 'vendorsdev' database instead of the default
  {
    _driver: new MongoInternals.RemoteCollectionDriver(
      mongoUrl, 
      {
        oplogUrl
      }
    )
  } : 
  {};

export const Documents = new Mongo.Collection('documents', opts);

...and all those "slow query" commands appear to be about the remote database ("vendorsdev"), not the main database, since they contain this pattern-field:

"pattern": "^(?:vendorsdev\\.|admin\\.\\$cmd)",
  • The version of Meteor showing the problem: 2.6.1
  • The operating system you're running Meteor on: Ubuntu 20.04
  • The expected behavior: modest CPU usage
  • The actual behavior: 100% CPU usage
@arggh
Copy link
Contributor Author

arggh commented Apr 3, 2022

I believe I have an explanation. There might be some gaps or flaws in my understanding of oplog tailing, or oplog in general, so I'm not yet 100% certain, but I have managed to reproduce and prevent the problem by taking certain actions based on my hypothesis, so... promising.

In our example scenario, we have two somewhat interconnected apps which mostly use their own databases, called app1 and app2. The databases do however exist in the same MongoDB replica set, so they share a common oplog. App1 didn't basically have any usage, it just laid dormant, while app2 was being tested either manually or automatically (E2E via Cypress, load testing).

Every single time, these tests caused thousands of log messages complaining about slow oplog queries from Meteor, evergrowing in their duration, until the CPU usage of our MongoDB reached ~100%. This could also be seen in our monitoring dashboard's "Document Query Executor" graph:

Screenshot 2022-04-03 at 9 07 46

🟥 The red rectangle is the time range where we ran a load test on app2.

🟦 The blue rectangle is after the load test - showcasing how the amount of scanned objects remains high even after the test concluded.

What is happening here, according to my hypothesis, is that the oplog tailing from the dormant app1 is constantly trying to find any entries that would belong to it, thus being forced to scan hundreds of thousands of documents in the oplog that is filled only with entries concerning operations in app2's database.

The fix?

Just make sure there's at least one entry in the oplog for every oplog tailing process (eg. Meteor app) every now and then ✨

For example, in a very ad-hoc manner, I wrote a Puppeteer script to create a new user in the dormant, not-being-tested app1 every 5 seconds or so while running the load tests on app2.

Here's how the same load test went after this "fix":

Screenshot 2022-04-03 at 9 18 08

🟪 The purple rectangle is when we restarted the dormant app(s) that were stuck tailing the oplog for entries they couldn't find.

🟩 The green rectangle is another load test run, the only change being a few dummy transactions generated in the dormant app's database every 5 seconds. The DB load never got out of control and returned to normal after the load test ended.

The implications

In a setup of two or more Meteor apps sharing the same cluster or replica set, all using oplog tailing, the database will eventully slow down to a crawl when there's no activity for some period of time in any (single or multiple) of the apps, because their oplog tailing process will have to keep scanning through the entire oplog trying to find relevant entries.

(Possibly related to #10808, an unresolved and closed issue from 2019 about slow oplog tailing.)

@radekmie
Copy link
Collaborator

radekmie commented Apr 4, 2022

I can confirm that the exact same thing happens to us for a long time now (at least a couple of years). We moved everything that was easy to other clusters (new or at least ones with smaller write throughput) and it helped. Reducing the oplog size also helped (docs).

It'd be extremely easy and cheap to add a scheduled trigger that'd do something in each database. It could even have a separate collection that would receive a single insert and then get dropped immediately after.

@lynchem
Copy link

lynchem commented Jun 8, 2022

So we had a similar-ish case so I thought I'd comment here rather than open a new issue. We had staging and production in the same cluster and when I launched 2.7 (previously 2.5) to staging I noticed our DB cluster started sending more alerts than usual with slow queries and 20k docs scanned (when usually our alarm level is just >2k).

These graphs are Process CPU & Disk Utilisation respectively.
image

Where it drops off is when I deployed staging using a separate cluster.

So now with the staging DB moved to it's own cluster when there's no users on staging it's constantly firing the scanned document is >20K or even >30K. If I login and start doing regular operations it stops. I can switch off alarms for staging so it doesn't really matter but I'm worried that deploying this to production will also have these alarms go off when all our users are sleeping or in these brief periods of inactivity.

Is there a better way of solving this than writing some dummy record to the DB every few seconds? It'd be great to understand what the underlying changes to the oplog were in 2.6 that have provoked this. I saw mention of
we are maintaining a translation layer for the new oplog format, so if you read or rely on any behavior of it please read our oplogv2converter.js code
Is there other work planned on this ?

@fredmaiaarantes
Copy link
Member

fredmaiaarantes commented Jun 10, 2022

Hey! Afaik, the translation layer wouldn't be the cause as it's just reading the new messages from the oplog and converting them to the format that Meteor understands.
Unfortunately, we have not yet been able to prioritize this issue, but we will try to prioritize it.

The code for the translation layer is here, and there are tests for it in case anyone wants to check it out in the meantime.

@lynchem
Copy link

lynchem commented Jun 12, 2022

Thanks @fredmaiaarantes for looking at this. Following on from @arggh 's hypothesis.

What is happening here, according to my hypothesis, is that the oplog tailing from the dormant app1 is constantly trying to find any entries that would belong to it, thus being forced to scan hundreds of thousands of documents in the oplog that is filled only with entries concerning operations in app2's database.

If there's just one DB on the cluster but it's idle couldn't that have the same issue? i.e. scanning the entire oplog and causing all these scanned documents warnings and the high disk utilisation ?

@lynchem
Copy link

lynchem commented Jul 4, 2022

Any update on this @fredmaiaarantes ? This is still blocking up from upgrading past 2.5.8

@evolross
Copy link
Contributor

evolross commented Jul 7, 2022

So you’re all using Meteor’s mongo oplog tailing? Wasn’t that considered bad form back in 2015 due to scaling issues, hence redis-oplog, etc? Or are you using it for a different purpose?

@derwaldgeist
Copy link

derwaldgeist commented Jul 11, 2022

If there's just one DB on the cluster but it's idle couldn't that have the same issue? i.e. scanning the entire oplog and causing all these scanned documents warnings and the high disk utilisation ?

This might actually be the root cause for #10808 / #11578. We're only using one database on the cluster, and get error messages regularly. There is not much going on in the database for some time periods, which might explain why this is happening.

@lynchem
Copy link

lynchem commented Jul 12, 2022

Yeah, I think they're all related @derwaldgeist although I think the addition of the oplogv2converter in 2.6 made matters worse. Running 2.5 I don't see any mention of the oplog in slow queries but move to 2.6 and it's constant. It'd be great if we could get someone from Meteor to have a look at this @fredmaiaarantes.

@fredmaiaarantes
Copy link
Member

Yeah, I think they're all related @derwaldgeist although I think the addition of the oplogv2converter in 2.6 made matters worse. Running 2.5 I don't see any mention of the oplog in slow queries but move to 2.6 and it's constant. It'd be great if we could get someone from Meteor to have a look at this @fredmaiaarantes.

Hey! Thanks for sharing more scenarios and related issues.
We plan for next week to try to reproduce the reported scenarios to find the root cause of the problem.

@bricous
Copy link

bricous commented Jul 27, 2022

Hi,
Maybe related ? My single instance replicaset mongodb 4 came near 100% cpu after moving my app to 2.6 and all became very very slow. I needed to upgrade mongodb from 4 to 4.4 to return to usual values <3%. 4.2 did not resolved 100% cpu. I have never tried mongodb > 4.4.

@lynchem
Copy link

lynchem commented Jul 27, 2022

That's interesting @bricous. I hadn't even thought about the mongo server version. We're running 4.2 and I see in the changelog for 4.4 that a variety of things were changed related to the oplog. I'll try with a cluster running 4.4 and see if it goes away.

@sampaiodiego
Copy link

I have found the new MongoDB driver has removed support to the flag oplogReplay because starting from MongoDB 4.4 it is automatically applied by the MongoDB server. But for MongoDB versions <=4.2, trying to tail the oplog on big servers might take a long time without it. In fact, after we upgraded Rocket.Chat to Meteor 2.7, the oplog query started to timeout in some servers, causing performance issues to MongoDB servers due to a COLLSPAN in the entirely oplog.rs collection and real time data not working at in Rocket.Chat.

I've filled this issue to MongoDB Node driver: https://jira.mongodb.org/projects/NODE/issues/NODE-4467

I've also submitted this PR hoping they accept it: mongodb/node-mongodb-native#3337

@derwaldgeist
Copy link

I'll try with a cluster running 4.4 and see if it goes away.

Stupid question: How do you swap out the driver? I thought it's baked into the Meteor framework somehow.

@lynchem
Copy link

lynchem commented Aug 3, 2022

Stupid question: How do you swap out the driver? I thought it's baked into the Meteor framework somehow.

Yeah, the node driver version is determined by the Meteor version you're using. What I was referring to was the mongo server version. So I believe upgrading to 4.4 on our server might make the issue dissapear.

@derwaldgeist
Copy link

Thanks for the clarification!

@Grubba27
Copy link
Contributor

Hello there! I'm currently working on this, and I do get some big metrics on an app on version 2.7.4 I will try to downgrade versions until I find(testing currently 2.5.6) the one that does not consume so many resources. I'm looking also at what changed between these versions in OpLog to have this much of an impact.

@sampaiodiego
Copy link

hi @Grubba27 the changes you're looking for were introduced on version 4.0 of mongodb node driver. I've included more details on the issue I opened to them https://jira.mongodb.org/browse/NODE-4467

tldr starting from mongodb@4.0 they're not passing the oplogReplay cursor flag through the wire, which causes issues on MongoDB 4.2 and below.. MongoDB server starting from 4.4 applies this flag automatically, that's why they removed.

I've already provided a patch to the mongodb node driver as you can see mongodb/node-mongodb-native#3337 .. but apparently this will be shipped on 4.9 only.

@Floriferous
Copy link
Contributor

Floriferous commented Aug 30, 2022

We're not 100% sure yet, but it seems like our update from 2.5.2 to 2.7.3 just crashed our entire production deployment - and it seems like this might be the culprit.

EDIT: We reverted all our systems back to 2.5.2 and the problem went away. This was a scary one for us, as it took 4 days since the deploy of 2.7 until the slowdown happened, making the diagnosis a bit more challenging.

@lynchem
Copy link

lynchem commented Aug 30, 2022

What version of MongoDB is your server running @Floriferous ?

@radekmie
Copy link
Collaborator

Also, the MongoDB driver 4.9 includes a potential fix for this, which will be released in the next 2.8 beta.

@fredmaiaarantes
Copy link
Member

This potential fix is out on 2.8-beta.7, right @Grubba27?

@Grubba27
Copy link
Contributor

Yes!! I've been doing some tests using this new beta version, and I think we are reaching a solution for this matter.
A big thanks to @sampaiodiego for its PR!
If anyone wants to give the new beta version a try, you can use:
meteor update --release 2.8-beta.7

@Floriferous
Copy link
Contributor

What version of MongoDB is your server running @Floriferous ?

We're on 4.2.22

@Grubba27
Copy link
Contributor

Grubba27 commented Nov 8, 2022

Hey @arggh. This is solved in 2.8.x
if this may occur again for any reason, I will be happy to open this issue again

@Grubba27 Grubba27 closed this as completed Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests