-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double granting of ticket after reconnection #68
Comments
There's a "booth grant" run by a cron once a minute. Makes it rather difficult to follow the logs. booth (the client) is really supposed to be used by the administrator and not run automatically in a loop. |
If genuine, this is a serious bug. Using hb_report is not |
Hi Dejan,
Our challenge has been in reproducing the problem. We did see the issue
twice, but have not been able to recreate it. If we do, I will let you know
immediately.
Thank you,
Chaka Allen
…On Tue, May 1, 2018, 2:47 AM Dejan Muhamedagic ***@***.***> wrote:
If genuine, this is a serious bug. Using hb_report is not
difficult and helps tremendously with log analysis. Please let
us know if you need help with providing the report.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#68 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AkDYoxyEUALeIhwoGdkrgKvJ2gwcbVdlks5tuATugaJpZM4TiXA_>
.
|
True about the strange cron initiated possible interferences. Was the purpose to degrade safely to site-only mode of operation for I suspect what happens here is that the polling scheme allows for But forbidding users to handle tickets manually altogether is like |
On Mon, Jun 25, 2018 at 10:12:37AM -0700, Jan Pokorný wrote:
True about the strange cron initiated interferces.
Was the purpose to degrade safely to site-only mode of operation for
the ticket-guarded resources (therefore assuming these are capable of
such a mode, I am not very familiar with DRBD) in the split sites
scenario? Perhaps it would be wiser if we came up with something
directly in the main booth logic?
I suspect what happens here is that the polling scheme allows for
a slight context intermixing as the Raft state transitions are not
atomic but phased over response-reply handling that can be, here
unexpectedly, interrupted and _mangled_ with the external ticket
handling requests. I have no proof of that, though.
But forbidding users to handle tickets manually altogether is like
offering an autonomous vehicle that just picks a destination
at random :-/
Who said that? However, it is arguably poor practice to have the
cron job automatically manage tickets, in particular in this
manner. If anything, it is going to confuse every human being
trying to look into the matter. It reminds me of another
installation with a cron job which would start a cluster resource
once every minute.
Granting tickets to more than one site is _never_ to occur,
regardless of manual requests or whatever else happens, fair use
or not.
The problem here is that there is not enough information and the
cron job makes it rather hard to follow the logs. I did try to
read the logs, but eventually gave up. If somebody has more time
and more stamina, please go ahead ;-)
|
Did you ever manage to reproduce the issue? |
Ran into challenge:
In chaka.txt:
Apr 19 13:12:24 Network failure
Apr 19 13:13:04 New election started on site2-db1 while disconnected
Apr 19 13:13:25 site2-db1 kernel: drbd dforce: role( Primary -> Secondary )
In journalctl-with-split-brain.txt:
Apr 19 13:13:38 Ticket granted to site1-db1
Apr 19 13:13:38 site1-db1 kernel: drbd dforce: role( Secondary -> Primary )
In chaka.txt:
Apr 19 13:19:55 site2-db1 boothd-site[1487]: [info] drbdticket (Lead/20/59999): granted successfully here
Apr 19 13:19:55 site2-db1 kernel: drbd dforce: role( Secondary -> Primary )
Both logs then show split brain problems, because both believe they are primary.
These are run in CentOS 7 VM boxes. The network was disconnected by unplugging the network cable on site2-db1, then plugging it back in later.
We're wondring why site2-db1 was able to re-acquire the ticket after the connection was restored.
We've had difficulty reproducing the problem. These are logs and conf files from the reproduction. Please let us know if you need anything else.
chaka.txt
journalctl-with-split-brain.txt
booth.conf.txt
five-server-poc-setup.txt
The text was updated successfully, but these errors were encountered: