Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/112 lost messages #113

Merged
merged 1 commit into from Oct 16, 2018
Merged

Conversation

emasab
Copy link

@emasab emasab commented Sep 10, 2018

Unit test and fix for issue:

#112

@emasab
Copy link
Author

emasab commented Sep 24, 2018

Hi, did you have time to take a look at it?

Thanks

@ppatierno
Copy link
Member

ppatierno commented Sep 24, 2018

Thanks, I will do!

@vietj
Copy link
Contributor

vietj commented Sep 24, 2018

@ppatierno we do have a 3.5.4 release soon, we also backport this if you review it.

@ppatierno
Copy link
Member

@vietj I have planned to review this one today ;)

@vietj
Copy link
Contributor

vietj commented Sep 24, 2018

if you can backport it to 3.5 branch and add it here it would be great : vert-x3/issues#401

@ppatierno
Copy link
Member

@vietj I am thinking about this PR and the problem that @emasab it facing. He noticed that :

this.current = records.iterator()

is called even when we are still getting messages from current.
The handler where the above line lives is called from pollRecords with:

this.context.runOnContext(v -> handler.handle(records));

So on the main event loop right? How it's possible for you to have two calls concurrently?

@ppatierno
Copy link
Member

finally another question for @emasab do we really need the submitted variable?

@emasab
Copy link
Author

emasab commented Sep 25, 2018

@ppatierno
about the submitted variable, I've put it because the polling variable has to be set to false only if no task was submitted to the Vertx thread (there we are in the ExecutorService thread). Otherwise polling has to be set to false exactly when the task is executed from Vertx, in the same flow of execution that overrites the iterator.

About the concurrency problem, it seems that the flow of executing is more or less this one:

  • iterator is empty, schedule is called at schedule-0

  • task1 is submitted to vertx at schedule-1

  • consumer is paused and resumed, schedule is called at resume

  • task2 is submitted to vertx at schedule-1

  • task1 enters pollRecords submits kafka1 at worker

  • task2 enters pollRecords submits kafka2 at worker

Here the bug isn't avoidable anymore

  • kafka1 polls some records records1.count()>0 at poll-1

  • kafka1 submits task3 to vertx at poll-2

  • kafka2 polls some records records2.count()>0 at poll-1

  • kafka2 sends task4 to vertx at poll-2

  • task3 writes current = records1 at current

  • task4 overwrites current = records2 at current

The names correspond to this pieces of code:

schedule-0

 // Don't call pollRecords directly, but use schedule() to actually pause when the readStream is paused
schedule(0);

schedule-1

Handler<ConsumerRecord<K, V>> handler = this.recordHandler;
this.context.runOnContext(v1 -> {

resume

if (this.paused.compareAndSet(true, false)) {
   this.schedule(0);
}

pollRecords

if (this.current == null || !this.current.hasNext()) {

      this.pollRecords(records -> {  

worker

this.worker.submit(() -> {

poll-1

ConsumerRecords<K, V> records = this.consumer.poll(pollTimeout);

poll-2

this.context.runOnContext(v -> handler.handle(records));

current

this.current = records.iterator();

@vietj
Copy link
Contributor

vietj commented Sep 25, 2018

could you rebase this branch ? I pushed a late commit in master to support ReadStream#fetch method that is new in 3.6

@emasab
Copy link
Author

emasab commented Sep 25, 2018

almost ready, but could you push the fetch method of the interfaces? It gives me a compile error

@vietj
Copy link
Contributor

vietj commented Sep 26, 2018

the method is in vertx core master

@emasab
Copy link
Author

emasab commented Sep 26, 2018

ah, ok. It's in ReadStream

@emasab emasab force-pushed the bugfix/112-lost-messages branch 2 times, most recently from c046506 to 47ee65a Compare September 27, 2018 07:05
@emasab
Copy link
Author

emasab commented Sep 27, 2018

now it's ready to be merged

exceptionHandler.handle(e);
}
}
if(!this.polling.compareAndSet(false, true)){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could make this method always run on Event Loop and remove the Atomic here with:

private void pollRecords(Handler<ConsumerRecords<K, V>> handler) {
  if (Vertx.currentContext() == this.context) {
    pollRecordsImpl(handler);
  } else {
    vertx.runOnContext(v -> pollRecordsImpl(handler));
  }
}

private void pollRecordsImpl(Handler<ConsumerRecords<K, V>> handler) {
  ...
}

this way we avoid the CAS and most of the time this will be anyway called from the event loop.

@emasab
Copy link
Author

emasab commented Sep 28, 2018

It wouldn't be the same because there we are already inside the vertx thread, the AtomicBoolean is useful because the variable is accessed by the vertx thread here and by the executorService thread at

if(!submitted) this.polling.set(false);

What the code does is to avoid submitting to the worker now and it tries again when the iterator has been overwritten and the if at pollRecords returns false. It could be that this causes a semi-loop because with schedule(0) it causes a loop, until the iterator is overwritten so there must be a cleaner solution

@emasab
Copy link
Author

emasab commented Sep 28, 2018

Another thing, I've found that if you remove the schedule(1) at line 142, most times it works but sometimes that partition hangs and the unit test goes timeout. I've found this behavior happening with a real kafka broker too, at some point in time the partitions hang one after one until the consumer blocks completely.

@vietj
Copy link
Contributor

vietj commented Sep 29, 2018

@emasab in this case if that's only read by the executor service, we need only to be volatile, or at least we can avoid the compare and set if that's always written from the same thread and just use a set which is cheaper.

@emasab
Copy link
Author

emasab commented Sep 29, 2018

Yes, that's why in the beginning I haven't used an AtomicBoolean, because the alternative is to do

if(!this.polling){
      this.polling = true;
     ....

} else {
   schedule(1);
}

but there only one thread, the vertx thread, executing this piece of code so it cannot be that two threads enter the if before polling is set to true. But if we use the AtomicBoolean it's better to use compareAndSet, otherwise we could use a simple boolean like in this example and make it volatile

@emasab
Copy link
Author

emasab commented Oct 1, 2018

I've made a new change and with this change the loop or semi-loop (with schedule(1)) should be completely avoided while the bug continues to be fixed

Now the polling boolean is used exclusively by vertx thread so I could replace it with a simple boolean if you agree

@emasab
Copy link
Author

emasab commented Oct 3, 2018

news?

@ppatierno
Copy link
Member

@vietj is it fine with you if I merge this one? @emasab do you need to fix anything more?

@vietj
Copy link
Contributor

vietj commented Oct 16, 2018

I think it's fine but keep an eye on the CI.

I had to fix tests because they were reporting false in Travis and make sure that the tests pass more consistently.

@ppatierno ppatierno merged commit 00655e9 into vert-x3:master Oct 16, 2018
@emasab
Copy link
Author

emasab commented Oct 16, 2018

Nothing more to fix, just this change

Now the polling AtomicBoolean is used exclusively by vertx thread so I could replace it with a simple boolean if you agree

but it doesn't change much, do you want to make it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants