Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build #98

Closed
wants to merge 6 commits into from
Closed

Fix build #98

wants to merge 6 commits into from

Conversation

thijsc
Copy link
Collaborator

@thijsc thijsc commented Nov 12, 2019

Attempt to fix the build. Also upgrades to librdkafka 1.2.2.

All individual tests are green based on some random sampling. It just loses connection to the broker after a while. Re-running failing specs immediately after produced a green result.

Not sure why this is breaking exactly.

@thijsc
Copy link
Collaborator Author

thijsc commented Nov 13, 2019

I think this might a a docker compose issue, no clue why it randomly stalls. We never see this happen when running this gem in production. Any ideas are welcome.

@rgo
Copy link

rgo commented Nov 17, 2019

Yesterday, I was doing some tests and now consumer behaviour have changed.
In Consumer context "#pause and #resume" consumer doesn't get assignment (line 65 fails).

If it helps, I modified wait_for_assignment function to raise an exception:

def wait_for_assignment(consumer)
  10.times do
    break if !consumer.assignment.empty?
    sleep 1
  end
  raise TypeError.new("No assignment for consumer")
end

I also was looking for Kafka+docker issues but I had no lucky.

@thijsc
Copy link
Collaborator Author

thijsc commented Dec 2, 2019

Spent some more time on this, not getting anywhere. Tempted to get rid of the compose setup and run Kafka/Zookeeper locally again.

@thijsc
Copy link
Collaborator Author

thijsc commented Dec 9, 2019

Symptoms locally are that the connection to the broker is lost:

E, [2019-12-09T21:34:04.279632 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Failed to resolve 'localhost:9092': nodename nor servname provided, or not known (after 0ms in state CONNECT)
E, [2019-12-09T21:34:04.281905 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Failed to resolve 'localhost:9092': nodename nor servname provided, or not known (after 0ms in state CONNECT)
E, [2019-12-09T21:34:04.283745 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: 1/1 brokers are down

On Travis a bunch of other things break, but trying to get the build green locally first.

@thijsc
Copy link
Collaborator Author

thijsc commented Dec 9, 2019

Tried some bisection, but this is tricky since older librkafka versions don't work anymore because of the timespec linking issue. @edenhill do you have any suggestions?

@edenhill
Copy link

What's the problem? The resolution errors for localhost?
That does look odd indeed, what we've seen in other places is that the resolver bugs out if the process runs out of fds (check the fdlimit).

@thijsc
Copy link
Collaborator Author

thijsc commented Dec 12, 2019

Tried running this with wider FD limits and same result. The test suite runs fine for a while, and runs fine if you run individual tests. If you run the whole set it is unable to keep a connection after a while.

gaffneyc added a commit to gaffneyc/rdkafka-ruby that referenced this pull request Jan 3, 2020
As part of some work to look into the build failures (see karafka#98) I wanted
an easy way to run an automated bisect against different librdkafka
commits. This task is designed to be used during development and the
default task should still be used when installing the gem.
@gaffneyc
Copy link
Contributor

gaffneyc commented Jan 4, 2020

After a bit of poking I discovered that removing the AutoPointer from Config#native_kafka gets the tests running consistently on 1.2.2 for me locally.

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 6, 2020

@gaffneyc Nice find! That suggests there's something wrong with the termination of the client. Going to poke around in this direction too.

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 24, 2020

@gaffneyc I'm trying using the handle without the destroy, test suite is still losing connection about half way through. Did you make any other changes?

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 24, 2020

Still completely stuck on this issue, I have no leads at the moment. Behaviour I'm seeing:

  • Running specs individually or in small groups succeeds.
  • Running the entire consumer spec lets rdkafka lose connection to the local cluster about half-way through.

If anybody has any suggestions on in which direction to prod further I'd be very happy to hear them. @edenhill any new thoughts on this maybe?

@thijsc thijsc mentioned this pull request Jan 24, 2020
@gaffneyc
Copy link
Contributor

@thijsc No, sorry, I spent a couple hours on it one night and wasn't able to get it working. My theory was that using AutoPointer was causing issues because when objects were collected was non-deterministic and most (all?) need to be freed before the handle can be destroyed otherwise it blocks.

To test the theory I ended up starting a new client from scratch and had enough luck with it that I just... kept... going. I recently open sourced it here: https://github.com/deadmanssnitch/kafka/ and we're using it in production as of this week.

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 24, 2020

Hm not super happy about that honestly, think we don't need further fragmentation and would rather welcome your contributions here.

@gaffneyc
Copy link
Contributor

gaffneyc commented Jan 24, 2020

I'm sorry 😞 it wasn't my intention to upset you or fragment the community. It started as a proof of concept and I felt that it diverged enough from rdkafka that it made sense to be its own thing.

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 24, 2020

You are of course within your rights to do whatever you want. It just seems to me that we're better off with one wrapper gem if there is no real divergence on the approach. Shall we discuss moving these two projects back into one via another channel?

In the meantime I tried removing all destroy calls to see if that fixes the consumer spec, it doesn't. So I think incorrect cleanup causing this is a red herring.

@gaffneyc
Copy link
Contributor

gaffneyc commented Jan 24, 2020

Shall we discuss moving these two projects back into one via another channel?

Feel free to email me at my github username at gmail.

Two thoughts to quickly decide if it is AutoPointer causing issues. When a test fails run the GC then check ObjectSpace to see how many AutoPointers or other Rdkafka objects are still in memory. Second, I believe when calling rd_kafka_destroy with debug logging it will print counts of the remaining objects attached to the handle. If anything it could tell if it's worth investigating cleanup being the issue.

@thijsc
Copy link
Collaborator Author

thijsc commented Jan 24, 2020

I actually just removed all autopointers and destroy calls, thus introducing a bunch of memory leaks. Same symptoms.

@gaffneyc
Copy link
Contributor

@thijsc I took a pass at removing AutoPointer from TopicPartitionList (see #107) as there are some cases where librdkafka can return a pointer to a TPL (see #subscription for example) and the application is now required to call destroy on that TPL. I'm assuming this could lead to the handle not being able to be destroyed.

I was able to get the tests all passing locally with the change.

@gaffneyc
Copy link
Contributor

gaffneyc commented Jan 24, 2020

All passing except for one segfault where the Consumer or Producer's @native_kafka was a pointer to NULL and it tried to call rd_kafka_destroy on NULL.

@Adithya-copart
Copy link
Contributor

Adithya-copart commented Jan 29, 2020

@gaffneyc I had a similar segfault in a forked process when the autopointer releaser call to rd_kafka_destroy is made after GC'ing @native_kafka object.

I think I made some progress in fixing the build failure by switching the docker images in #106. All the jobs failed with the same two spec failures.

Update:
I'm facing the same issue in #106 now. It's before a fork call and eliminates the chances of it being GC'd in the new process which could lead to a rd_kafka_destroy call.

I wonder if rd_kafka_consumer_close free's the object first and then AutoPointer finalizer is called leading a seg fault.https://github.com/appsignal/rdkafka-ruby/blob/d0d71f669daadbfeec056085b8abe065478be784/lib/rdkafka/consumer.rb#L24

All passing except for one segfault where the Consumer or Producer's @native_kafka was a pointer to NULL and it tried to call rd_kafka_destroy on NULL.

@thijsc
Copy link
Collaborator Author

thijsc commented Feb 18, 2020

Build was fixed in #108

@thijsc thijsc closed this Feb 18, 2020
Adithya-copart pushed a commit to Adithya-copart/rdkafka-ruby that referenced this pull request Jun 1, 2020
As part of some work to look into the build failures (see karafka#98) I wanted
an easy way to run an automated bisect against different librdkafka
commits. This task is designed to be used during development and the
default task should still be used when installing the gem.
@thijsc thijsc deleted the fix_build branch February 28, 2021 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants