Potential fix for issue 744 #830

copiousfreetime · 2022-12-11T06:48:56Z

This is a potential fix for #744 - and includes a test case that fails on my laptop. See test/test_744.rb and the notes in it.

I don't expect this PR to be merged as is, as there is a fairly large data file now in the test directory so others can attempt to replicate the issue.

I have 2 branches

bugfix/issue-744 which is just the newly added test
bugfix/issue-744-patch which is 1 more commit a40725c which fixes the issue for me.

This is mostly a guideline for discussion to figure out the proper solution

This is a potential solution for GitHub Issue ohler55#744 which has a randomly occurring segfault when Oj:Doc.open with a block is used multiple times across forks and the garbage collector kicks in. * removes the use of `ALLOC` and `ALLOCA` macros in favor of `RB_ALLOC_N` in in the `doc_open*` and `parse_json` functions * delegates all freeing of memory to the `doc_free` function * associates the allocated json string memory to the doc before the doc is type wrapped * remove the calls to `rb_gc_disable` and `rb_gc_enable`

ohler55 · 2022-12-11T21:18:22Z

Can you set up the test to generate the needed files instead of adding the huge files to the repo?

I assume that running the test will fail without your changes.

If it all looks good I'll remove the added dependency on the parallel gem to try and keep the dependencies down.

copiousfreetime · 2022-12-12T04:53:47Z

I wanted to see if this was an acceptable solution approach that would work for the project before seeing if I can generate data that causes it to fail. That might take a bit more time. I'm willing to do that, and actually what I would want to be in there too.

I'd like to see if someone else can replicate the failure. Did checking out bugfix/issue-744 fail for you? Does running the following blow up for you?

git clone git@github.com:copiousfreetime/oj.git oj-issue-744
cd oj-issue-744 && git co bugfix/issue-744
bundle install && bundle exec rake compile && bundle exec ruby test/test_744.rb

ohler55 · 2022-12-15T01:32:11Z

Sorry for the delay. I tested before and after. You fix, as expected, indeed fixes the issue. If you can clean up the dependencies and large files I'd be glad to merge. Thanks for helping out.

ohler55 · 2023-01-15T22:51:41Z

I think this was already taken care of. Okay to close?

copiousfreetime · 2023-01-16T22:30:54Z

@ohler55 is there another commit that does a similar thing on main? I’ve been away for a while - and I’ll be doing a fair bit of performance testing this week on the codebase that this error appears, so I can see if it is fixed in another version.

copiousfreetime · 2023-01-16T22:33:27Z

@ohler55 oh I see 7280870 - I’ll test the latest version this week and confirm.

ohler55 · 2023-01-16T22:33:36Z

If I'm mistaken (always possible) this PR can be reopened. I thought it got taken care of in a slightly different way. I'll wait you for your check.

copiousfreetime · 2023-01-24T21:44:32Z

@ohler55 I kicked off some long running processes using version ec9f0d1 and after running for almost an hour, they all get OOMKilled by linux. The only change i made to the codebase from the what wsa running yesterday was bump oj from 3.11.2 to ec9f0d1. Rolling back to 3.11.2 and seeing if things work just fine.

ohler55 · 2023-01-24T23:04:47Z

Did you mean 3.11.2 or the latest version? I guess it is okay though unless Oj is the cause of the out of memory error in which case there is a different issue to address. Right?

copiousfreetime · 2023-01-24T23:21:12Z

I had my system running with oj 3.11.2 which had the the issue that sometimes, it would segfault.

I updated only oj to ec9f0d1, as in that was the only thing that changed between for this additional run and the ec2 instance fell over with an oomkiller error against the ruby process (after about an hour of running). (had to hard reboot).

I did this 3 times just to make sure that it wasn't a fluke - oomkiller all 3 times.

So I rolled back my system to use 3.11.2 which has the known issue of sometimes segfaulting - and the runs all went normally (randomly segfaulting).

I'm fairly confident that this means there is a memory leak in in ec9f0d1. It may only arise when used oj is used across forked processes.

ohler55 · 2023-01-24T23:33:25Z

Okay, will look into it and see if your changes fix it. Thanks

copiousfreetime · 2023-01-25T16:42:54Z

@ohler55 I'll plan on doing a long run against my patch before the end of next week too.

copiousfreetime · 2023-01-28T16:58:37Z

Well - unfortunately - it looks like my patch also suffers from the same long-term memory leak/reference leak. its takes about an hour of continually running, but it also gets oom-killed. So for the moment I would not commit my patch, and I would also back out your version of the same fix. I'm not sure if I'll be able to figure this one in the next week or so.

ohler55 · 2023-01-28T17:06:39Z

ok, I'll do some cleanup.

copiousfreetime · 2023-01-30T18:28:34Z

The trimmed down version of the code is what I have in my PR in https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744.rb and the data in https://github.com/copiousfreetime/oj/tree/bugfix/issue-744-patch/test/test_744-data - although I haven't tried this test on the production level infrastructure, I'm going to do that this week.

The machine I'm running this on is an AWS r5.8xlarge - so 258GB of ram and 32 cpus - and while proccessing a file, this machines cpus are fully saturated using the parallel gem

OS - ubuntu 20.04
Ruby - 2.7.7
rubygems - 3.1.6
bundler 2.4.3

The files that get processed are in the format of https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744-data/small-dataset - but each one is much larger, its generally 6-14GB in size, and 5-10 Million rows in length.

It looks like on my side that the process is getting oomkilled after processing about 3 files so after approximately 30 million rows of data.

My next steps are:

probably change the code to die and get restarted after processing a single file instead of pipelineing one file after another
see if the performance of using the fast parser is still there for my use case - and if not - stop using it
try the test script out and see if I can catch the memory leak in progress - or at least get it oomkilled

ohler55 · 2023-01-30T19:52:07Z

If you really want to get crazy my agoo repo has a debug.h and debug.c file that sets up a memory tracking system. Oj is not set up to use that but with some grepping and replacing it could me made to identify leaks. I can take a look at that and see if I can get Oj modified or at least on a branch.

ohler55 · 2023-02-06T00:57:27Z

I have a "mem-debug" branch that might be helpful. If you add a -DMEM_DEBUG to the cflags in the Makefile and then call Oj.mem_report periodically you will see memory in use. It will be slower and will use more memory but it can be useful for tracking memory use issues.

copiousfreetime · 2023-02-06T17:05:31Z

Thanks - I'll see if I can get it to replicate on that branch this week.

copiousfreetime added 3 commits December 9, 2022 09:42

wip

f3342a5

Add a test script that segfaults / has a memory error

89e4c5b

copiousfreetime mentioned this pull request Dec 11, 2022

Segfault in parallel processing #744

Closed

copiousfreetime changed the title ~~Potential fix #744~~ Potential fix for issue 744 Dec 11, 2022

ohler55 closed this Jan 16, 2023

ohler55 reopened this Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential fix for issue 744 #830

Potential fix for issue 744 #830

copiousfreetime commented Dec 11, 2022

ohler55 commented Dec 11, 2022 •

edited

copiousfreetime commented Dec 12, 2022 •

edited

ohler55 commented Dec 15, 2022

ohler55 commented Jan 15, 2023

copiousfreetime commented Jan 16, 2023

copiousfreetime commented Jan 16, 2023

ohler55 commented Jan 16, 2023

copiousfreetime commented Jan 24, 2023

ohler55 commented Jan 24, 2023

copiousfreetime commented Jan 24, 2023

ohler55 commented Jan 24, 2023

copiousfreetime commented Jan 25, 2023

copiousfreetime commented Jan 28, 2023

ohler55 commented Jan 28, 2023

copiousfreetime commented Jan 30, 2023

ohler55 commented Jan 30, 2023

ohler55 commented Feb 6, 2023

copiousfreetime commented Feb 6, 2023

Potential fix for issue 744 #830

Are you sure you want to change the base?

Potential fix for issue 744 #830

Conversation

copiousfreetime commented Dec 11, 2022

ohler55 commented Dec 11, 2022 • edited

copiousfreetime commented Dec 12, 2022 • edited

ohler55 commented Dec 15, 2022

ohler55 commented Jan 15, 2023

copiousfreetime commented Jan 16, 2023

copiousfreetime commented Jan 16, 2023

ohler55 commented Jan 16, 2023

copiousfreetime commented Jan 24, 2023

ohler55 commented Jan 24, 2023

copiousfreetime commented Jan 24, 2023

ohler55 commented Jan 24, 2023

copiousfreetime commented Jan 25, 2023

copiousfreetime commented Jan 28, 2023

ohler55 commented Jan 28, 2023

copiousfreetime commented Jan 30, 2023

ohler55 commented Jan 30, 2023

ohler55 commented Feb 6, 2023

copiousfreetime commented Feb 6, 2023

ohler55 commented Dec 11, 2022 •

edited

copiousfreetime commented Dec 12, 2022 •

edited