New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential fix for issue 744 #830
base: develop
Are you sure you want to change the base?
Potential fix for issue 744 #830
Conversation
This is a potential solution for GitHub Issue ohler55#744 which has a randomly occurring segfault when Oj:Doc.open with a block is used multiple times across forks and the garbage collector kicks in. * removes the use of `ALLOC` and `ALLOCA` macros in favor of `RB_ALLOC_N` in in the `doc_open*` and `parse_json` functions * delegates all freeing of memory to the `doc_free` function * associates the allocated json string memory to the doc before the doc is type wrapped * remove the calls to `rb_gc_disable` and `rb_gc_enable`
Can you set up the test to generate the needed files instead of adding the huge files to the repo? I assume that running the test will fail without your changes. If it all looks good I'll remove the added dependency on the parallel gem to try and keep the dependencies down. |
I wanted to see if this was an acceptable solution approach that would work for the project before seeing if I can generate data that causes it to fail. That might take a bit more time. I'm willing to do that, and actually what I would want to be in there too. I'd like to see if someone else can replicate the failure. Did checking out
|
Sorry for the delay. I tested before and after. You fix, as expected, indeed fixes the issue. If you can clean up the dependencies and large files I'd be glad to merge. Thanks for helping out. |
I think this was already taken care of. Okay to close? |
@ohler55 is there another commit that does a similar thing on main? I’ve been away for a while - and I’ll be doing a fair bit of performance testing this week on the codebase that this error appears, so I can see if it is fixed in another version. |
If I'm mistaken (always possible) this PR can be reopened. I thought it got taken care of in a slightly different way. I'll wait you for your check. |
@ohler55 I kicked off some long running processes using version ec9f0d1 and after running for almost an hour, they all get OOMKilled by linux. The only change i made to the codebase from the what wsa running yesterday was bump oj from 3.11.2 to ec9f0d1. Rolling back to 3.11.2 and seeing if things work just fine. |
Did you mean 3.11.2 or the latest version? I guess it is okay though unless Oj is the cause of the out of memory error in which case there is a different issue to address. Right? |
I had my system running with oj 3.11.2 which had the the issue that sometimes, it would segfault. I updated only oj to ec9f0d1, as in that was the only thing that changed between for this additional run and the ec2 instance fell over with an oomkiller error against the ruby process (after about an hour of running). (had to hard reboot). I did this 3 times just to make sure that it wasn't a fluke - oomkiller all 3 times. So I rolled back my system to use 3.11.2 which has the known issue of sometimes segfaulting - and the runs all went normally (randomly segfaulting). I'm fairly confident that this means there is a memory leak in in ec9f0d1. It may only arise when used oj is used across forked processes. |
Okay, will look into it and see if your changes fix it. Thanks |
@ohler55 I'll plan on doing a long run against my patch before the end of next week too. |
Well - unfortunately - it looks like my patch also suffers from the same long-term memory leak/reference leak. its takes about an hour of continually running, but it also gets oom-killed. So for the moment I would not commit my patch, and I would also back out your version of the same fix. I'm not sure if I'll be able to figure this one in the next week or so. |
ok, I'll do some cleanup. |
The trimmed down version of the code is what I have in my PR in https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744.rb and the data in https://github.com/copiousfreetime/oj/tree/bugfix/issue-744-patch/test/test_744-data - although I haven't tried this test on the production level infrastructure, I'm going to do that this week. The machine I'm running this on is an AWS r5.8xlarge - so 258GB of ram and 32 cpus - and while proccessing a file, this machines cpus are fully saturated using the parallel gem
The files that get processed are in the format of https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744-data/small-dataset - but each one is much larger, its generally 6-14GB in size, and 5-10 Million rows in length. It looks like on my side that the process is getting oomkilled after processing about 3 files so after approximately 30 million rows of data. My next steps are:
|
If you really want to get crazy my agoo repo has a debug.h and debug.c file that sets up a memory tracking system. Oj is not set up to use that but with some grepping and replacing it could me made to identify leaks. I can take a look at that and see if I can get Oj modified or at least on a branch. |
I have a "mem-debug" branch that might be helpful. If you add a |
Thanks - I'll see if I can get it to replicate on that branch this week. |
This is a potential fix for #744 - and includes a test case that fails on my laptop. See
test/test_744.rb
and the notes in it.I don't expect this PR to be merged as is, as there is a fairly large data file now in the test directory so others can attempt to replicate the issue.
I have 2 branches
This is mostly a guideline for discussion to figure out the proper solution