New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maintenance taking too much resources since 2.2.222 #3909
Comments
This change in behaviour is most likely due to the following commit.
I am very curious to see, if that number (30-40GB) is for your 2.1.214 case, and if so, how it changed (if at all) with the latest version 2.2.224. |
@andreitokar that sounds like the DB is never managing to finish a compaction properly, and is just periodically re-doing the compaction without making any actual progress. |
I have a direct comparison only with 2.2.220. With 2.2.220 the DB was around 15 GB before SHUTDOWN COMPACT, with 2.2.222 and 2.2.224 it is ~35-45 GB. That really sounds to me like @grandinj says, that the auto compaction doesn't work properly in my case. I will experiment with the AUTO_COMPACT_FILL_RATE and post the results. |
oops |
@andreitokar On the other side, 13740e1 is working wonders for me. In my scenario, with 2.2.200 and earlier, my DB grew by several GBs every day. With 2.2.222 it grows by no more than several hundred MBs per day. That's 5x-10x improvement, which is a lot. |
Hello, I was also eager to get this housekeeping improvement/fix in order to have a database in 2.X that does not grow wildly (cf. #3848). But unfortunately, I have the same issue than the one reported by @V0174 😢 It looks like the H2 housekeeping keeps writing all the time. A quick profiling (60 sec.) shows this: Let me know if you need anything from me to help the investigation of this issue. Kind regards, |
@vreuland If I understand correctly, the problem is that H2, while still "idle" (no client requests) keep consuming resources without any perceivable benefits, and this going on forever, right?. While db in that state, can you periodically query INFORMATION_SCHEMA.SETTINGS table and see what are the values for:
This might give some insights as to what went wrong with housekeeping. Another question: If you would try to shutdown database, and then restart again and keep idle, will it continue it's vicious cycle. And if so, can you share such database file? |
This helped a lot, thanks. It was the weekend so there was a bit less data, but the difference is clear. With AUTO_COMPACT_FILL_RATE set to 50, the average CPU went down 93% (!) and counterintuitively even the DB size before compacting went down about half. Still, "manual" compacting on shutdown reduces the size to about one fourth of the original size, but much better than one eighth before. I will experiment some more, but it will take time. |
Hi @andreitokar, Yes, correct. Even though there is no requests, it keeps consuming resources for no benefits and that goes on forever. I will try the database shutdown/restart and keep you posted. Regards, |
Hello @andreitokar I had finally a bit of time to look further into this issue. If I simply shutdown the database then reopen it, the issue is still there. In order to understand more what is happening, you can find attached a dump (using MVStoreTool -dump) of the chunks metadata: mvstore-dump.zip
You can find the log attached: tracing.txt My observations are:
Please have a look. Thank you. |
This happened to me also, but with 2.2.220. I had a database that was opening/closing fine, but every attempt to add a new row to a table failed. SHUTDOWN COMPACT somehow repaired the DB. |
@wburzyns, it looks like a totally different case. Of course, it is possible that SHUTDOWN COMPACT helped in both cases. SHUTDOWN COMPACT recreates the whole database file, so no wonder that it fixes (or may break) something along the way, although original problem might be quite different. |
Please disregard mvstore-dump.zip attached above.
Could the fact that some blocks are "corrupted" explain the problem of the endless housekeeping? @andreitokar please let me know if I can extract anything else that could be helpful to investigate this issue. Kind regards, |
@vreuland I tend to agree with you, that it's most likely MVStoreTool problem. That code was neglected for a long time and probably not up to date with latest MVStore changes. |
Greetings! Coming from #3948.
Although the second MVStore process is not showing anymore as before.
You can reproduce the behaviour by:
For me this is a big problem because:
|
Setting Can I recommend to introduce a kind of tenor, so this housekeeping will occur only when idle within a certain period (e. g. every 20 mins when idle) -- but not continuously. |
Fyi, this describes our particular use-case applied to the provide database. (Which is a ledger, where new Debit/Credit entries are appended continuously at the end). |
It's getting weird:
It almost looks like the lowest AUTO_COMPACT_FILL_RATE allowed to repair the store structure. I will repeat more tests later, starting from scratch with an empty/new database. Update: after a while, the IO problem returns with AUTO_COMPACT_FILL_RATE=90, so this "repair" does not to seem permanent or indeed 90 is actually causing the problem during the attempt of house keeping. |
Yes and yes ( [removed] ). As written above, closing the DB, setting AUTO_COMPACT_FILL_RATE=5, opening the DB seems to "repair" it (after a massive initial WRITE activity). After that, closing it and setting AUTO_COMPACT_FILL_RATE=90 does not harm anymore. |
@manticore-projects Thank you for the test case, it really helped. |
Thank you for your work and effort @andreitokar and team! We all do appreciate it. Merry Christmas. |
bring it. |
@manticore-projects you have admin access now to the H2 repo |
Hello @andreitokar Thank you for the fix you have made. Looking further into this, I could see that:
I thought at first that the best would be to find a way to determine before doing the chunk rewrite, whether this rewrite will help or not. But I am not sure that this is posssible using only the chunks metadata we have today (live/dead pages) or without scanning all the maps...
That was the best version I could come up with in regard to the different test scenarios I have performed:
I would be very glad to have your view on this and wonder if this could be officially included if I draft a proper PR. Thank you in advance. |
Hi @vreuland, I agree with you on both points
|
Hello @andreitokar , I have directly replied in the PR regarding the rational behind the stopping condition I used (#4000 (comment)), but it is indeed probably not flawless. Let me try your new proposal with the different tests I have. I will probably need a couple of days but can return with a clear report. Thanks again. |
Hi, I have very quickly added some "traces" (main variables used in the housekeeping dumped in stdout...) to see more in-depth what was going on. (cf. vreuland@ba42400) Here is the log: housekeeping-refinement-traces-3.txt We can actually see that the stopIdleHousekeeping never switches to "true":
I can maybe replace the condition I still have the feeling that we "simply" need to have to 2 clear conditions involved:
The equilibrium in those conditions is definitively hard to find but it is an interesting challenge 😄 Kr, |
indeed makes more sense - if rewritten amount was so small that overall fill rate stays the same (almost), what is the point to continue?
It simply says that housekeeping need to be resumed if rewritableChunksFillRate drops by more than 2% from it's value at the time we stopped. The idea is to resume when more chunks become eligible for rewriting and database is still idle and with housekeeping stopped - otherwise it seems like a lost opportunity On somewhat unrelated note: I wonder, if we should decrease default for RETENTION_TIME from 45000 to lets say 1000? BTW, you do not have to do any surgery on H2, but instead can do a query like
Such query does not do any I/O and therefor won't disturb "idle" status. |
I must admit I am still not fully convinced that the rewritableChunksFillRate is the good candidate to use to decide whether to resume the housekeeping mainly because of this effect of the retention time. At the same time, we have the "chunksFillRate" which gives, from my understanding, an "instantaneous" view of the current fill rate (no effect of retention or asynchronous operations) and looks therefore more suited to drive the pausing/resuming of the housekeeping. And that does not mean the rewritableChunksFillRate is not used at all: It is used but a step later after the stop/resume housekeeping condition and to simply decide whether a rewriting should be tried in the current housekeeping iteration. I will continue to test your PR (with Would it be possible to get again your sample database (the one you shared originally in #3909 (comment)) ? It would be great if I can also test the latest fix proposals on it 😃 |
I am not asking to tweak retention_time, just saying that it seems unreasonably high and has a huge effect on database file size and will change pattern of housekeeping work.
I think we agree here on pausing - chunksFillRate directly reflects our goal and failure to improve is the reason to stop. |
How about making it an option or switch? |
I am really sorry, I don't have it anymore since @andreitokar's improvement solved that problem reliably for us on all servers and we run always with latest GIT. |
Hello @andreitokar
Ok. I understand your point. It is clearly an optimization to resume the housekeeping a bit sooner in certain cases (even though I am still not clearly see what could be the scenario leading to those cases to be honest). Thank you for all your effort on this. Again, it is greatly appreciated. |
Hi @andreitokar The main scenario has run more than 2 days and consists in triggering every 15 minutes a high level of requests (that lasts 5 min.) towards the app and staying idle in between. There is also an housekeeping job in the app that is cleaning every hour the data older than 2 days (meaning the database size should remain almost constant after 2 days) and on top of that, an online database backup is taken also every hour. Note that scenario makes H2 behave badly using 2.1.214 (database size is quickly exploding) or 2.2.224 (high CPU usage in the idle periods). In terms of DB file size, results are ok. Database size is kept more or less at the same level (even if the graph pattern is slighly different) : In terms of CPU, it is is also ok. CPU remains very low in idle periods: There is just one thing I have observed is that during the high requests periods, the CPU using andreitokar:issue-3909 is twice higher... Confirmed with a profiling of the instances, H2 is serializing (writing) twice as much in those high requests periods than when using the other PR... I have run other scenarios (constant low requests rate for instance) and H2 was behaving ok and similarly with both PRs. At the end, I believe you proposal/PR is fine. In my specific scenario I still think my PR behave a bit better but it is obviously difficult to assess if that would be the case in other scenarios. And for this higher CPU usage pointed out above, I think it is still ok and not too worrying. So please, go ahead with merging one of those PRs. 😄 |
Hello @andreitokar I hope you are doing great. Best regards, |
I've dropped housekeeping resumption, because it's clearly does nothing useful in my tests. Merged. |
Hello,
the daily usage of my H2 instance (custom Java application with H2 library) can be divided into four phases:
Up to 2.1.214, this worked mostly well, with H2 taking just negligible CPU when not queried. But when I updated to 2.2.224, suddenly after phase 1 is finished, H2 continues to use a lot of resources in phase 2 - about 20-30% of the CPU available (8 cores), disk, RAM. Mostly the H2 thread and GC (see the screenshot below). After the DB is shut down and reopened, it does not consume much resources anymore. That means c. 12-16 hours of several cores utilized to the max, with a lot of disk writes and memory consumed.
The virtual machine has 8 cores and 16 GB of RAM. The DB is opened with the following parameters:
WRITE_DELAY=10000;DB_CLOSE_ON_EXIT=FALSE;CACHE_SIZE=2097152;MAX_COMPACT_TIME=60
The database size is typically around 30-40GB before daily compacting (5GB afterwards) and it has 10 tables with just a few columns (but a lot of records) with some hash indexes.
When experimenting with different versions, I discovered that this started with 2.2.222. I didn't find anything in the changelog regarding this so I suppose it is a side effect of some other change?
Does anyone else experience this? Could this be solved by some parameter tweaking? I understand that this is difficult to troubleshoot, but I don't even know where I would start about creating a test case. I'll try to experiment further, but it takes time.
The text was updated successfully, but these errors were encountered: