Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix for issue 744 #830

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

copiousfreetime
Copy link

This is a potential fix for #744 - and includes a test case that fails on my laptop. See test/test_744.rb and the notes in it.

I don't expect this PR to be merged as is, as there is a fairly large data file now in the test directory so others can attempt to replicate the issue.

I have 2 branches

This is mostly a guideline for discussion to figure out the proper solution

This is a potential solution for GitHub Issue ohler55#744 which has a randomly
occurring segfault when Oj:Doc.open with a block is used multiple times
across forks and the garbage collector kicks in.

* removes the use of `ALLOC` and `ALLOCA` macros in favor of
  `RB_ALLOC_N` in in the `doc_open*` and `parse_json` functions
* delegates all freeing of memory to the `doc_free` function
* associates the allocated json string memory to the doc before
  the doc is type wrapped
* remove the calls to `rb_gc_disable` and `rb_gc_enable`
@copiousfreetime copiousfreetime changed the title Potential fix #744 Potential fix for issue 744 Dec 11, 2022
@ohler55
Copy link
Owner

ohler55 commented Dec 11, 2022

Can you set up the test to generate the needed files instead of adding the huge files to the repo?

I assume that running the test will fail without your changes.

If it all looks good I'll remove the added dependency on the parallel gem to try and keep the dependencies down.

@copiousfreetime
Copy link
Author

copiousfreetime commented Dec 12, 2022

I wanted to see if this was an acceptable solution approach that would work for the project before seeing if I can generate data that causes it to fail. That might take a bit more time. I'm willing to do that, and actually what I would want to be in there too.

I'd like to see if someone else can replicate the failure. Did checking out bugfix/issue-744 fail for you? Does running the following blow up for you?

git clone git@github.com:copiousfreetime/oj.git oj-issue-744
cd oj-issue-744 && git co bugfix/issue-744
bundle install && bundle exec rake compile && bundle exec ruby test/test_744.rb

@ohler55
Copy link
Owner

ohler55 commented Dec 15, 2022

Sorry for the delay. I tested before and after. You fix, as expected, indeed fixes the issue. If you can clean up the dependencies and large files I'd be glad to merge. Thanks for helping out.

@ohler55
Copy link
Owner

ohler55 commented Jan 15, 2023

I think this was already taken care of. Okay to close?

@ohler55 ohler55 closed this Jan 16, 2023
@copiousfreetime
Copy link
Author

@ohler55 is there another commit that does a similar thing on main? I’ve been away for a while - and I’ll be doing a fair bit of performance testing this week on the codebase that this error appears, so I can see if it is fixed in another version.

@copiousfreetime
Copy link
Author

@ohler55 oh I see 7280870 - I’ll test the latest version this week and confirm.

@ohler55
Copy link
Owner

ohler55 commented Jan 16, 2023

If I'm mistaken (always possible) this PR can be reopened. I thought it got taken care of in a slightly different way. I'll wait you for your check.

@copiousfreetime
Copy link
Author

@ohler55 I kicked off some long running processes using version ec9f0d1 and after running for almost an hour, they all get OOMKilled by linux. The only change i made to the codebase from the what wsa running yesterday was bump oj from 3.11.2 to ec9f0d1. Rolling back to 3.11.2 and seeing if things work just fine.

@ohler55
Copy link
Owner

ohler55 commented Jan 24, 2023

Did you mean 3.11.2 or the latest version? I guess it is okay though unless Oj is the cause of the out of memory error in which case there is a different issue to address. Right?

@copiousfreetime
Copy link
Author

I had my system running with oj 3.11.2 which had the the issue that sometimes, it would segfault.

I updated only oj to ec9f0d1, as in that was the only thing that changed between for this additional run and the ec2 instance fell over with an oomkiller error against the ruby process (after about an hour of running). (had to hard reboot).

I did this 3 times just to make sure that it wasn't a fluke - oomkiller all 3 times.

So I rolled back my system to use 3.11.2 which has the known issue of sometimes segfaulting - and the runs all went normally (randomly segfaulting).

I'm fairly confident that this means there is a memory leak in in ec9f0d1. It may only arise when used oj is used across forked processes.

@ohler55
Copy link
Owner

ohler55 commented Jan 24, 2023

Okay, will look into it and see if your changes fix it. Thanks

@ohler55 ohler55 reopened this Jan 24, 2023
@copiousfreetime
Copy link
Author

@ohler55 I'll plan on doing a long run against my patch before the end of next week too.

@copiousfreetime
Copy link
Author

Well - unfortunately - it looks like my patch also suffers from the same long-term memory leak/reference leak. its takes about an hour of continually running, but it also gets oom-killed. So for the moment I would not commit my patch, and I would also back out your version of the same fix. I'm not sure if I'll be able to figure this one in the next week or so.

@ohler55
Copy link
Owner

ohler55 commented Jan 28, 2023

ok, I'll do some cleanup.

@copiousfreetime
Copy link
Author

The trimmed down version of the code is what I have in my PR in https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744.rb and the data in https://github.com/copiousfreetime/oj/tree/bugfix/issue-744-patch/test/test_744-data - although I haven't tried this test on the production level infrastructure, I'm going to do that this week.

The machine I'm running this on is an AWS r5.8xlarge - so 258GB of ram and 32 cpus - and while proccessing a file, this machines cpus are fully saturated using the parallel gem

  • OS - ubuntu 20.04
  • Ruby - 2.7.7
  • rubygems - 3.1.6
  • bundler 2.4.3

The files that get processed are in the format of https://github.com/copiousfreetime/oj/blob/bugfix/issue-744-patch/test/test_744-data/small-dataset - but each one is much larger, its generally 6-14GB in size, and 5-10 Million rows in length.

It looks like on my side that the process is getting oomkilled after processing about 3 files so after approximately 30 million rows of data.

My next steps are:

  • probably change the code to die and get restarted after processing a single file instead of pipelineing one file after another
  • see if the performance of using the fast parser is still there for my use case - and if not - stop using it
  • try the test script out and see if I can catch the memory leak in progress - or at least get it oomkilled

@ohler55
Copy link
Owner

ohler55 commented Jan 30, 2023

If you really want to get crazy my agoo repo has a debug.h and debug.c file that sets up a memory tracking system. Oj is not set up to use that but with some grepping and replacing it could me made to identify leaks. I can take a look at that and see if I can get Oj modified or at least on a branch.

@ohler55
Copy link
Owner

ohler55 commented Feb 6, 2023

I have a "mem-debug" branch that might be helpful. If you add a -DMEM_DEBUG to the cflags in the Makefile and then call Oj.mem_report periodically you will see memory in use. It will be slower and will use more memory but it can be useful for tracking memory use issues.

@copiousfreetime
Copy link
Author

Thanks - I'll see if I can get it to replicate on that branch this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants