Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JRuby and fix travis failures #108

Merged
merged 15 commits into from Feb 18, 2020
Merged

Conversation

Adithya-copart
Copy link
Contributor

@Adithya-copart Adithya-copart commented Feb 4, 2020

Re-opening #106 here to eliminate the noise due to numerous debugging commits.
I find it easier to trigger travis builds than running the specs locally.

The changes in this PR also include @gaffneyc's changes in #107:

  1. Workarounds for the lack of read_int64, read(:pointer), read_string_to_null in JRuby.
  2. Avoid using FFI::AutoPointer objects inside Rdkafka::Consumer and explicitly calling rdkafka_destroy during close.
  3. Avoid calls to rd_kafka_consumer_poll after consumer is closed.
  4. Use RbConfig::CONFIG to check the host OS which is JRuby friendly.
  5. Switch to confluentinc/cp-kafka and confluentinc/cp-zookeeper licensed with a Apache 2.0 license.
  6. Add JRuby to the list of rubies in .travis.yml.
  7. Remove the usage of AutoPointer as it'll lead to closing the socket connections when a child process created using Kernel#fork exits which could lead to segmentation faults.

Findings:

E, [2020-01-29T06:17:58.952506 #9416] ERROR -- : rdkafka: [thrd:GroupCoordinator]: 1/1 brokers are down in specs is trivial as per confluentinc/confluent-kafka-dotnet#1129 (comment)

I'll gladly rebase this PR if you'd wish to merge #107 first.

Adithya Pentela and others added 3 commits February 3, 2020 19:22
Free ptr objects in ensure blocks and add workaround for lack of read(:pointer), read_string_to_null and read_int64 ffi methods in JRuby

Add JRuby 9.2.9.0 to travis

WIP: Debugging.
Modify error_spec and skip the spec using fork in JRuby.
Use RbConfig to check for MacOS.

WIP: Free @native_kafka in the consumer/producer during close

Manually free the AutoPointer using ruby instead of a C call

Switch to confluence docker images

ProducerSpec: Closing the IO in forked process

Avoiding AutoPointer instances being copied into forked process

Alter the consumer spec to test for the  message produced

Remove consumer commit after assertions and increase the handle wait for JRuby for reason still unknown

Rely on GC to call the finalizer to close @native_kafka socket

Revert "Rely on GC to call the finalizer to close @native_kafka socket"

This reverts commit 38ea6d6.

Skip AutoPointer for consumer class

Remove rd_kafka_consumer_close as rd_kafka_destroy takes care of it for consumers

Adding rd_kafka_consumer_close and ensure it is only called once

Skip fork spec

Revert "Skip fork spec"

This reverts commit d4cfb58.

Remove autorelease and free pointers accessed using read_pointer
When converting a Rdkafka::TopicPartitionList to a native type, Rdkafka
was depending on AutoPointer to eventually free them. There are a couple
cases where a Client handle returns a pointer to the application that
the application is then required to free, which could be leading to test
instability.

This takes a pass at ensuring all uses of native TopicPartitionList
instances (even those return) are deterministically freed.

Ensure poll returns nil after a consumer is closed

Add the read_array_of_uint64 workaround for JRuby
…stroy will result in a segmentation fault but not #free
@Adithya-copart Adithya-copart changed the title Jruby Support JRuby and fix travis failures Feb 4, 2020
@gaffneyc
Copy link
Contributor

gaffneyc commented Feb 4, 2020

Just as a note, it looks like jruby is working on updating their version of FFI for the new APIs.

…4_t.

Use AutoPointer#free in producer instead of rd_kafka_destroy to close @native_kafka instances if consumer#close is not called.
@Adithya-copart
Copy link
Contributor Author

Even though this build is green, there is a segmentation fault in the forked process.

The AutoPointer objects result in a call to rdkafka_destory after GC in both the parent and forked process. The second call to rdkafka_destroy results in a segmentation fault as the object is already free'd.

@thijsc
Copy link
Collaborator

thijsc commented Feb 5, 2020

Looking very good, thanks for your hard work on this! Guess we should get rid of the last usage of autopointer.

@Adithya-copart
Copy link
Contributor Author

Debug level shows this before ERROR -- : rdkafka: [thrd:GroupCoordinator]: 1/1 brokers are down:

D, [2020-02-05T10:33:49.874983 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": delete consume_test_topic [1]
D, [2020-02-05T10:33:49.874996 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": waiting for 1 toppar(s), 2 unassignment(s), 0 commit(s) (state up, join-state wait-unassign) before terminating
D, [2020-02-05T10:33:49.875009 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v5) for consume_test_topic [1]
D, [2020-02-05T10:33:49.875022 #97694] DEBUG -- : rdkafka: [thrd:main]: Unassign not done yet (1 wait_unassign, 1 assigned, 0 wait commit, join state wait-unassign): FETCH_STOP done
D, [2020-02-05T10:33:49.875043 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" received op PARTITION_LEAVE in state up (join state wait-unassign, v5) for consume_test_topic [2]
D, [2020-02-05T10:33:49.875058 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": delete consume_test_topic [2]
D, [2020-02-05T10:33:49.875070 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": waiting for 0 toppar(s), 1 unassignment(s), 0 commit(s) (state up, join-state wait-unassign) before terminating
D, [2020-02-05T10:33:49.875083 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v5) for consume_test_topic [2]
D, [2020-02-05T10:33:49.875095 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": unassign done in state up (join state wait-unassign): without new assignment: FETCH_STOP done
D, [2020-02-05T10:33:49.875108 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": leave (in state up)
D, [2020-02-05T10:33:49.875133 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" changed join state wait-unassign -> init (v5, state up)
D, [2020-02-05T10:33:49.875169 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": waiting for 0 toppar(s), 0 unassignment(s), 0 commit(s), wait-leave, (state up, join-state init) before terminating
D, [2020-02-05T10:33:49.875186 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839": waiting for 0 toppar(s), 0 unassignment(s), 0 commit(s), wait-leave, (state up, join-state init) before terminating
D, [2020-02-05T10:33:49.878075 #97694] DEBUG -- : rdkafka: [thrd:main]: LeaveGroup response received in state up
D, [2020-02-05T10:33:49.878115 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" changed state up -> term (v5, join-state init)
D, [2020-02-05T10:33:49.878147 #97694] DEBUG -- : rdkafka: [thrd:main]: Group "ruby-test-178839" broker localhost:9092/1 is no longer coordinator
E, [2020-02-05T10:33:49.878195 #97694] ERROR -- : rdkafka: [thrd:GroupCoordinator]: 1/1 brokers are down

@thijsc Looks like this is similar to confluentinc/confluent-kafka-dotnet#1129 and the reason is this.

@Adithya-copart
Copy link
Contributor Author

Looks like travis encountered a segmentation fault while doing a cleanup after running the specs even after getting rid of AutoPointer. Not sure which object is triggering this Segmentation fault.

We might need debugging enabled in Travis as the specs work locally for me against 2.6.3.

JRuby delivery callback spec failure is intermittent in travis and always passes in local.
I will bump the timeout and see if it'll be consistent.

@Adithya-copart
Copy link
Contributor Author

Adithya-copart commented Feb 16, 2020

@thijsc 6d7b472 ensures that all socket connections are closed which fixed the travis segmentation fault.

Twelve socket connections were previously open by the time Kernel#fork is called in the specs. So, these connections are shared between the parent and forked process and could be responsible for the segmentation fault in travis.

I modified the spec in 3e5331a to make sure the producer is instantiated in the child process. I think sharing the sockets between processes should not be encouraged. It's easier to create a new producer in a forked process.

I changed the behavior of Consumer#close and Producer#close to make sure rd_kafka_destory is only called once. This added the side-effect of returning true when the socket is actually closed and nil in subsequent calls. Is it acceptable to change the return value?

@thijsc
Copy link
Collaborator

thijsc commented Feb 18, 2020

Thanks so much for your hard work on this!

@rearden-steel
Copy link

Seems that you have not updated lib/rdkafka/version.rb to reflect version change.

@Adithya-copart
Copy link
Contributor Author

@rearden-steel I'm assuming that it'll be done by the maintainers when publishing a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants