Fix build #98

thijsc · 2019-11-12T21:46:46Z

Attempt to fix the build. Also upgrades to librdkafka 1.2.2.

All individual tests are green based on some random sampling. It just loses connection to the broker after a while. Re-running failing specs immediately after produced a green result.

Not sure why this is breaking exactly.

thijsc · 2019-11-13T10:05:51Z

I think this might a a docker compose issue, no clue why it randomly stalls. We never see this happen when running this gem in production. Any ideas are welcome.

rgo · 2019-11-17T10:16:34Z

Yesterday, I was doing some tests and now consumer behaviour have changed.
In Consumer context "#pause and #resume" consumer doesn't get assignment (line 65 fails).

If it helps, I modified wait_for_assignment function to raise an exception:

def wait_for_assignment(consumer)
  10.times do
    break if !consumer.assignment.empty?
    sleep 1
  end
  raise TypeError.new("No assignment for consumer")
end

I also was looking for Kafka+docker issues but I had no lucky.

thijsc · 2019-12-02T16:03:19Z

Spent some more time on this, not getting anywhere. Tempted to get rid of the compose setup and run Kafka/Zookeeper locally again.

thijsc · 2019-12-09T20:37:48Z

Symptoms locally are that the connection to the broker is lost:

E, [2019-12-09T21:34:04.279632 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Failed to resolve 'localhost:9092': nodename nor servname provided, or not known (after 0ms in state CONNECT)
E, [2019-12-09T21:34:04.281905 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Failed to resolve 'localhost:9092': nodename nor servname provided, or not known (after 0ms in state CONNECT)
E, [2019-12-09T21:34:04.283745 #27004] ERROR -- : rdkafka: [thrd:localhost:9092/bootstrap]: 1/1 brokers are down

On Travis a bunch of other things break, but trying to get the build green locally first.

thijsc · 2019-12-09T21:07:44Z

Tried some bisection, but this is tricky since older librkafka versions don't work anymore because of the timespec linking issue. @edenhill do you have any suggestions?

edenhill · 2019-12-10T10:01:39Z

What's the problem? The resolution errors for localhost?
That does look odd indeed, what we've seen in other places is that the resolver bugs out if the process runs out of fds (check the fdlimit).

thijsc · 2019-12-12T07:57:33Z

Tried running this with wider FD limits and same result. The test suite runs fine for a while, and runs fine if you run individual tests. If you run the whole set it is unable to keep a connection after a while.

As part of some work to look into the build failures (see karafka#98) I wanted an easy way to run an automated bisect against different librdkafka commits. This task is designed to be used during development and the default task should still be used when installing the gem.

gaffneyc · 2020-01-04T03:36:09Z

After a bit of poking I discovered that removing the AutoPointer from Config#native_kafka gets the tests running consistently on 1.2.2 for me locally.

thijsc · 2020-01-06T12:57:03Z

@gaffneyc Nice find! That suggests there's something wrong with the termination of the client. Going to poke around in this direction too.

thijsc · 2020-01-24T13:29:22Z

@gaffneyc I'm trying using the handle without the destroy, test suite is still losing connection about half way through. Did you make any other changes?

thijsc · 2020-01-24T13:34:52Z

Still completely stuck on this issue, I have no leads at the moment. Behaviour I'm seeing:

Running specs individually or in small groups succeeds.
Running the entire consumer spec lets rdkafka lose connection to the local cluster about half-way through.

If anybody has any suggestions on in which direction to prod further I'd be very happy to hear them. @edenhill any new thoughts on this maybe?

gaffneyc · 2020-01-24T14:41:51Z

@thijsc No, sorry, I spent a couple hours on it one night and wasn't able to get it working. My theory was that using AutoPointer was causing issues because when objects were collected was non-deterministic and most (all?) need to be freed before the handle can be destroyed otherwise it blocks.

To test the theory I ended up starting a new client from scratch and had enough luck with it that I just... kept... going. I recently open sourced it here: https://github.com/deadmanssnitch/kafka/ and we're using it in production as of this week.

thijsc · 2020-01-24T14:50:03Z

Hm not super happy about that honestly, think we don't need further fragmentation and would rather welcome your contributions here.

gaffneyc · 2020-01-24T15:14:46Z

I'm sorry 😞 it wasn't my intention to upset you or fragment the community. It started as a proof of concept and I felt that it diverged enough from rdkafka that it made sense to be its own thing.

thijsc · 2020-01-24T15:29:55Z

You are of course within your rights to do whatever you want. It just seems to me that we're better off with one wrapper gem if there is no real divergence on the approach. Shall we discuss moving these two projects back into one via another channel?

In the meantime I tried removing all destroy calls to see if that fixes the consumer spec, it doesn't. So I think incorrect cleanup causing this is a red herring.

gaffneyc · 2020-01-24T16:02:37Z

Shall we discuss moving these two projects back into one via another channel?

Feel free to email me at my github username at gmail.

Two thoughts to quickly decide if it is AutoPointer causing issues. When a test fails run the GC then check ObjectSpace to see how many AutoPointers or other Rdkafka objects are still in memory. Second, I believe when calling rd_kafka_destroy with debug logging it will print counts of the remaining objects attached to the handle. If anything it could tell if it's worth investigating cleanup being the issue.

thijsc · 2020-01-24T16:49:14Z

I actually just removed all autopointers and destroy calls, thus introducing a bunch of memory leaks. Same symptoms.

gaffneyc · 2020-01-24T17:45:42Z

@thijsc I took a pass at removing AutoPointer from TopicPartitionList (see #107) as there are some cases where librdkafka can return a pointer to a TPL (see #subscription for example) and the application is now required to call destroy on that TPL. I'm assuming this could lead to the handle not being able to be destroyed.

I was able to get the tests all passing locally with the change.

gaffneyc · 2020-01-24T17:47:06Z

All passing except for one segfault where the Consumer or Producer's @native_kafka was a pointer to NULL and it tried to call rd_kafka_destroy on NULL.

Adithya-copart · 2020-01-29T06:45:51Z

@gaffneyc I had a similar segfault in a forked process when the autopointer releaser call to rd_kafka_destroy is made after GC'ing @native_kafka object.

I think I made some progress in fixing the build failure by switching the docker images in #106. All the jobs failed with the same two spec failures.

Update:
I'm facing the same issue in #106 now. It's before a fork call and eliminates the chances of it being GC'd in the new process which could lead to a rd_kafka_destroy call.

I wonder if rd_kafka_consumer_close free's the object first and then AutoPointer finalizer is called leading a seg fault.https://github.com/appsignal/rdkafka-ruby/blob/d0d71f669daadbfeec056085b8abe065478be784/lib/rdkafka/consumer.rb#L24

All passing except for one segfault where the Consumer or Producer's @native_kafka was a pointer to NULL and it tried to call rd_kafka_destroy on NULL.

thijsc · 2020-02-18T09:55:55Z

Build was fixed in #108

As part of some work to look into the build failures (see karafka#98) I wanted an easy way to run an automated bisect against different librdkafka commits. This task is designed to be used during development and the default task should still be used when installing the gem.

thijsc self-assigned this Nov 12, 2019

thijsc mentioned this pull request Nov 12, 2019

fix travis builds #94

Closed

karafka deleted a comment from backlog-helper bot Nov 12, 2019

thijsc mentioned this pull request Nov 13, 2019

Abort trap: 6 Assertion failed: (r == 0), function rwlock_wrlock, file tinycthread_extra.c #95

Closed

This was referenced Nov 13, 2019

use queue as log callback to avoid unsafe calls from trap context #97

Merged

Add error to delivery report #86

Merged

thijsc force-pushed the fix_build branch from fda71b8 to aae30f6 Compare November 14, 2019 19:06

thijsc force-pushed the master branch from b0da428 to d07c959 Compare November 14, 2019 19:19

thijsc added 2 commits November 14, 2019 20:21

Attempt to fix specs

56173b7

Bump librdkafka to 1.2.2

5492c27

thijsc force-pushed the fix_build branch from aae30f6 to 5492c27 Compare November 14, 2019 19:21

Sleep in consumer spec

c955ea6

thijsc mentioned this pull request Dec 4, 2019

Provide way to use a different method of partitioning #89

Closed

thijsc added 3 commits December 9, 2019 16:28

Raise if waiting for assignment takes too long in spec helper

24a1590

Bump librdkafka to 1.3.0

c9ac8fd

Bump mini_portil2 requirement and remove monkeypatch

3112df0

This was referenced Jan 3, 2020

ext: Add build:git[ref] task for building librdkafka from a git sha #103

Merged

Update Kafka to 2.2.1 for development and testing #104

Closed

thijsc mentioned this pull request Jan 24, 2020

WIP: JRuby support #106

Closed

thijsc closed this Feb 18, 2020

thijsc deleted the fix_build branch February 28, 2021 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix build #98

Fix build #98

thijsc commented Nov 12, 2019

thijsc commented Nov 13, 2019

rgo commented Nov 17, 2019 •

edited

thijsc commented Dec 2, 2019

thijsc commented Dec 9, 2019

thijsc commented Dec 9, 2019

edenhill commented Dec 10, 2019

thijsc commented Dec 12, 2019

gaffneyc commented Jan 4, 2020 •

edited

thijsc commented Jan 6, 2020

thijsc commented Jan 24, 2020

thijsc commented Jan 24, 2020 •

edited

gaffneyc commented Jan 24, 2020

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 •

edited

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 •

edited

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 •

edited

Adithya-copart commented Jan 29, 2020 •

edited

thijsc commented Feb 18, 2020

Fix build #98

Fix build #98

Conversation

thijsc commented Nov 12, 2019

thijsc commented Nov 13, 2019

rgo commented Nov 17, 2019 • edited

thijsc commented Dec 2, 2019

thijsc commented Dec 9, 2019

thijsc commented Dec 9, 2019

edenhill commented Dec 10, 2019

thijsc commented Dec 12, 2019

gaffneyc commented Jan 4, 2020 • edited

thijsc commented Jan 6, 2020

thijsc commented Jan 24, 2020

thijsc commented Jan 24, 2020 • edited

gaffneyc commented Jan 24, 2020

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 • edited

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 • edited

thijsc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020

gaffneyc commented Jan 24, 2020 • edited

Adithya-copart commented Jan 29, 2020 • edited

thijsc commented Feb 18, 2020

rgo commented Nov 17, 2019 •

edited

gaffneyc commented Jan 4, 2020 •

edited

thijsc commented Jan 24, 2020 •

edited

gaffneyc commented Jan 24, 2020 •

edited

gaffneyc commented Jan 24, 2020 •

edited

gaffneyc commented Jan 24, 2020 •

edited

Adithya-copart commented Jan 29, 2020 •

edited