Double granting of ticket after reconnection #68

zuluman100 · 2018-04-24T20:26:16Z

Ran into challenge:

In chaka.txt:
Apr 19 13:12:24 Network failure
Apr 19 13:13:04 New election started on site2-db1 while disconnected
Apr 19 13:13:25 site2-db1 kernel: drbd dforce: role( Primary -> Secondary )

In journalctl-with-split-brain.txt:
Apr 19 13:13:38 Ticket granted to site1-db1
Apr 19 13:13:38 site1-db1 kernel: drbd dforce: role( Secondary -> Primary )

In chaka.txt:
Apr 19 13:19:55 site2-db1 boothd-site[1487]: [info] drbdticket (Lead/20/59999): granted successfully here
Apr 19 13:19:55 site2-db1 kernel: drbd dforce: role( Secondary -> Primary )

Both logs then show split brain problems, because both believe they are primary.

These are run in CentOS 7 VM boxes. The network was disconnected by unplugging the network cable on site2-db1, then plugging it back in later.

We're wondring why site2-db1 was able to re-acquire the ticket after the connection was restored.

We've had difficulty reproducing the problem. These are logs and conf files from the reproduction. Please let us know if you need anything else.

chaka.txt
journalctl-with-split-brain.txt

booth.conf.txt
five-server-poc-setup.txt

dmuhamedagic · 2018-04-25T12:13:13Z

There's a "booth grant" run by a cron once a minute. Makes it rather difficult to follow the logs. booth (the client) is really supposed to be used by the administrator and not run automatically in a loop.
The logs are too large and there is no log from the arbitrator. Could you please use hb_report to capture the logs and configuration around the time the problem occurred.

dmuhamedagic · 2018-05-01T06:47:09Z

If genuine, this is a serious bug. Using hb_report is not
difficult and helps tremendously with log analysis. Please let
us know if you need help with providing the report.

zuluman100 · 2018-05-01T15:36:39Z

Hi Dejan, Our challenge has been in reproducing the problem. We did see the issue twice, but have not been able to recreate it. If we do, I will let you know immediately. Thank you, Chaka Allen

…

On Tue, May 1, 2018, 2:47 AM Dejan Muhamedagic ***@***.***> wrote: If genuine, this is a serious bug. Using hb_report is not difficult and helps tremendously with log analysis. Please let us know if you need help with providing the report. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#68 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkDYoxyEUALeIhwoGdkrgKvJ2gwcbVdlks5tuATugaJpZM4TiXA_> .

jnpkrn · 2018-06-25T17:12:36Z

True about the strange cron initiated possible interferences.

Was the purpose to degrade safely to site-only mode of operation for
the ticket-guarded resources (therefore assuming these are capable of
such a mode, I am not very familiar with DRBD) in the split sites
scenario? Perhaps it would be wiser if we came up with something
directly in the main booth logic?

I suspect what happens here is that the polling scheme allows for
a slight context intermixing as the Raft state transitions are not
atomic but phased over response-reply handling that can be, here
unexpectedly, interrupted and mangled with the external ticket
handling requests. I have no proof of that, though.

But forbidding users to handle tickets manually altogether is like
offering an autonomous vehicle that just picks a destination
at random :-/

dmuhamedagic · 2018-06-27T08:16:05Z

On Mon, Jun 25, 2018 at 10:12:37AM -0700, Jan Pokorný wrote: True about the strange cron initiated interferces. Was the purpose to degrade safely to site-only mode of operation for the ticket-guarded resources (therefore assuming these are capable of such a mode, I am not very familiar with DRBD) in the split sites scenario? Perhaps it would be wiser if we came up with something directly in the main booth logic? I suspect what happens here is that the polling scheme allows for a slight context intermixing as the Raft state transitions are not atomic but phased over response-reply handling that can be, here unexpectedly, interrupted and _mangled_ with the external ticket handling requests. I have no proof of that, though. But forbidding users to handle tickets manually altogether is like offering an autonomous vehicle that just picks a destination at random :-/

Who said that? However, it is arguably poor practice to have the cron job automatically manage tickets, in particular in this manner. If anything, it is going to confuse every human being trying to look into the matter. It reminds me of another installation with a cron job which would start a cluster resource once every minute. Granting tickets to more than one site is _never_ to occur, regardless of manual requests or whatever else happens, fair use or not. The problem here is that there is not enough information and the cron job makes it rather hard to follow the logs. I did try to read the logs, but eventually gave up. If somebody has more time and more stamina, please go ahead ;-)

dmuhamedagic · 2021-03-21T11:39:31Z

Did you ever manage to reproduce the issue?

dmuhamedagic added Bug Unconfirmed labels Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double granting of ticket after reconnection #68

Double granting of ticket after reconnection #68

zuluman100 commented Apr 24, 2018

dmuhamedagic commented Apr 25, 2018

dmuhamedagic commented May 1, 2018

zuluman100 commented May 1, 2018 via email

jnpkrn commented Jun 25, 2018 •

edited

dmuhamedagic commented Jun 27, 2018 via email

dmuhamedagic commented Mar 21, 2021

Double granting of ticket after reconnection #68

Double granting of ticket after reconnection #68

Comments

zuluman100 commented Apr 24, 2018

dmuhamedagic commented Apr 25, 2018

dmuhamedagic commented May 1, 2018

zuluman100 commented May 1, 2018 via email

jnpkrn commented Jun 25, 2018 • edited

dmuhamedagic commented Jun 27, 2018 via email

dmuhamedagic commented Mar 21, 2021

jnpkrn commented Jun 25, 2018 •

edited