New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meteor oplog query causing MongoDB to use 100% CPU #11993
Comments
I believe I have an explanation. There might be some gaps or flaws in my understanding of oplog tailing, or oplog in general, so I'm not yet 100% certain, but I have managed to reproduce and prevent the problem by taking certain actions based on my hypothesis, so... promising. In our example scenario, we have two somewhat interconnected apps which mostly use their own databases, called Every single time, these tests caused thousands of log messages complaining about slow oplog queries from Meteor, evergrowing in their duration, until the CPU usage of our MongoDB reached ~100%. This could also be seen in our monitoring dashboard's "Document Query Executor" graph: 🟥 The red rectangle is the time range where we ran a load test on 🟦 The blue rectangle is after the load test - showcasing how the amount of scanned objects remains high even after the test concluded. What is happening here, according to my hypothesis, is that the oplog tailing from the dormant The fix?Just make sure there's at least one entry in the oplog for every oplog tailing process (eg. Meteor app) every now and then ✨ For example, in a very ad-hoc manner, I wrote a Puppeteer script to create a new user in the dormant, not-being-tested Here's how the same load test went after this "fix": 🟪 The purple rectangle is when we restarted the dormant app(s) that were stuck tailing the oplog for entries they couldn't find. 🟩 The green rectangle is another load test run, the only change being a few dummy transactions generated in the dormant app's database every 5 seconds. The DB load never got out of control and returned to normal after the load test ended. The implicationsIn a setup of two or more Meteor apps sharing the same cluster or replica set, all using oplog tailing, the database will eventully slow down to a crawl when there's no activity for some period of time in any (single or multiple) of the apps, because their oplog tailing process will have to keep scanning through the entire oplog trying to find relevant entries. (Possibly related to #10808, an unresolved and closed issue from 2019 about slow oplog tailing.) |
I can confirm that the exact same thing happens to us for a long time now (at least a couple of years). We moved everything that was easy to other clusters (new or at least ones with smaller write throughput) and it helped. Reducing the oplog size also helped (docs). It'd be extremely easy and cheap to add a scheduled trigger that'd do something in each database. It could even have a separate collection that would receive a single insert and then get dropped immediately after. |
Hey! Afaik, the translation layer wouldn't be the cause as it's just reading the new messages from the oplog and converting them to the format that Meteor understands. The code for the translation layer is here, and there are tests for it in case anyone wants to check it out in the meantime. |
Thanks @fredmaiaarantes for looking at this. Following on from @arggh 's hypothesis.
If there's just one DB on the cluster but it's idle couldn't that have the same issue? i.e. scanning the entire oplog and causing all these scanned documents warnings and the high disk utilisation ? |
Any update on this @fredmaiaarantes ? This is still blocking up from upgrading past 2.5.8 |
So you’re all using Meteor’s mongo oplog tailing? Wasn’t that considered bad form back in 2015 due to scaling issues, hence redis-oplog, etc? Or are you using it for a different purpose? |
This might actually be the root cause for #10808 / #11578. We're only using one database on the cluster, and get error messages regularly. There is not much going on in the database for some time periods, which might explain why this is happening. |
Yeah, I think they're all related @derwaldgeist although I think the addition of the oplogv2converter in 2.6 made matters worse. Running 2.5 I don't see any mention of the oplog in slow queries but move to 2.6 and it's constant. It'd be great if we could get someone from Meteor to have a look at this @fredmaiaarantes. |
Hey! Thanks for sharing more scenarios and related issues. |
Hi, |
That's interesting @bricous. I hadn't even thought about the mongo server version. We're running 4.2 and I see in the changelog for 4.4 that a variety of things were changed related to the oplog. I'll try with a cluster running 4.4 and see if it goes away. |
I have found the new MongoDB driver has removed support to the flag I've filled this issue to MongoDB Node driver: https://jira.mongodb.org/projects/NODE/issues/NODE-4467 I've also submitted this PR hoping they accept it: mongodb/node-mongodb-native#3337 |
Stupid question: How do you swap out the driver? I thought it's baked into the Meteor framework somehow. |
Yeah, the node driver version is determined by the Meteor version you're using. What I was referring to was the mongo server version. So I believe upgrading to 4.4 on our server might make the issue dissapear. |
Thanks for the clarification! |
Hello there! I'm currently working on this, and I do get some big metrics on an app on version 2.7.4 I will try to downgrade versions until I find(testing currently 2.5.6) the one that does not consume so many resources. I'm looking also at what changed between these versions in OpLog to have this much of an impact. |
hi @Grubba27 the changes you're looking for were introduced on version 4.0 of mongodb node driver. I've included more details on the issue I opened to them https://jira.mongodb.org/browse/NODE-4467 tldr starting from I've already provided a patch to the mongodb node driver as you can see mongodb/node-mongodb-native#3337 .. but apparently this will be shipped on 4.9 only. |
We're not 100% sure yet, but it seems like our update from 2.5.2 to 2.7.3 just crashed our entire production deployment - and it seems like this might be the culprit. EDIT: We reverted all our systems back to 2.5.2 and the problem went away. This was a scary one for us, as it took 4 days since the deploy of 2.7 until the slowdown happened, making the diagnosis a bit more challenging. |
What version of MongoDB is your server running @Floriferous ? |
Also, the MongoDB driver 4.9 includes a potential fix for this, which will be released in the next 2.8 beta. |
This potential fix is out on |
Yes!! I've been doing some tests using this new beta version, and I think we are reaching a solution for this matter. |
We're on 4.2.22 |
Hey @arggh. This is solved in 2.8.x |
We're trying to migrate to the latest Meteor (or 2.6.1 to be specific) and latest MongoDB (5.0.6).
For reasons unknown, apparently Meteor's oplog queries get flagged as "slow queries", and the number of those queries keep creeping up until our MongoDB is using all of it's CPU to handle them.
Example output from
db.currentOp({"secs_running": {$gte: 3}, op: 'query' })
:JSON
Stopping the Meteor app instantly reduces MongoDB CPU usage back to ~2%.
When restarting the Meteor app, it runs smoothly at first, but eventually the queries accumulate and the CPU usage jumps.
A timeline of (~zero usage) staging environment DB server CPU graph where you can see it become progressively worse, until we finally kill the Meteor app (sidenote: can't remember if and how many app restarts there were in the timeline)
Probably a crucial detail is that we are connecting from our app to another database using a configuration string like so:
...and all those "slow query" commands appear to be about the remote database ("vendorsdev"), not the main database, since they contain this
pattern
-field:The text was updated successfully, but these errors were encountered: