Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double granting of ticket after reconnection #68

Open
zuluman100 opened this issue Apr 24, 2018 · 6 comments
Open

Double granting of ticket after reconnection #68

zuluman100 opened this issue Apr 24, 2018 · 6 comments

Comments

@zuluman100
Copy link

Ran into challenge:

In chaka.txt:
Apr 19 13:12:24 Network failure
Apr 19 13:13:04 New election started on site2-db1 while disconnected
Apr 19 13:13:25 site2-db1 kernel: drbd dforce: role( Primary -> Secondary )

In journalctl-with-split-brain.txt:
Apr 19 13:13:38 Ticket granted to site1-db1
Apr 19 13:13:38 site1-db1 kernel: drbd dforce: role( Secondary -> Primary )

In chaka.txt:
Apr 19 13:19:55 site2-db1 boothd-site[1487]: [info] drbdticket (Lead/20/59999): granted successfully here
Apr 19 13:19:55 site2-db1 kernel: drbd dforce: role( Secondary -> Primary )

Both logs then show split brain problems, because both believe they are primary.

These are run in CentOS 7 VM boxes. The network was disconnected by unplugging the network cable on site2-db1, then plugging it back in later.

We're wondring why site2-db1 was able to re-acquire the ticket after the connection was restored.

We've had difficulty reproducing the problem. These are logs and conf files from the reproduction. Please let us know if you need anything else.

chaka.txt
journalctl-with-split-brain.txt

booth.conf.txt
five-server-poc-setup.txt

@dmuhamedagic
Copy link

There's a "booth grant" run by a cron once a minute. Makes it rather difficult to follow the logs. booth (the client) is really supposed to be used by the administrator and not run automatically in a loop.
The logs are too large and there is no log from the arbitrator. Could you please use hb_report to capture the logs and configuration around the time the problem occurred.

@dmuhamedagic
Copy link

If genuine, this is a serious bug. Using hb_report is not
difficult and helps tremendously with log analysis. Please let
us know if you need help with providing the report.

@zuluman100
Copy link
Author

zuluman100 commented May 1, 2018 via email

@jnpkrn
Copy link

jnpkrn commented Jun 25, 2018

True about the strange cron initiated possible interferences.

Was the purpose to degrade safely to site-only mode of operation for
the ticket-guarded resources (therefore assuming these are capable of
such a mode, I am not very familiar with DRBD) in the split sites
scenario? Perhaps it would be wiser if we came up with something
directly in the main booth logic?

I suspect what happens here is that the polling scheme allows for
a slight context intermixing as the Raft state transitions are not
atomic but phased over response-reply handling that can be, here
unexpectedly, interrupted and mangled with the external ticket
handling requests. I have no proof of that, though.

But forbidding users to handle tickets manually altogether is like
offering an autonomous vehicle that just picks a destination
at random :-/

@dmuhamedagic
Copy link

dmuhamedagic commented Jun 27, 2018 via email

@dmuhamedagic
Copy link

Did you ever manage to reproduce the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants