Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classic queue v2 can crash when used with mirroring #6176

Closed
mkuratczyk opened this issue Oct 19, 2022 · 6 comments
Closed

Classic queue v2 can crash when used with mirroring #6176

mkuratczyk opened this issue Oct 19, 2022 · 6 comments

Comments

@mkuratczyk
Copy link
Contributor

mkuratczyk commented Oct 19, 2022

In 3.11, when classic queues v2 are used with mirroring, they can crash like this:

** Reason for termination ==
** {{badmatch,-1},
    [{rabbit_mirror_queue_slave,update_delta,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,1081}]},
     {rabbit_mirror_queue_slave,process_instruction,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,1009}]},
     {rabbit_mirror_queue_slave,handle_cast,2,
                                [{file,"rabbit_mirror_queue_slave.erl"},
                                 {line,299}]},
     {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1067}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

  crasher:
    initial call: rabbit_prequeue:init/1
    pid: <0.12215.0>
    registered_name: []
    exception exit: {{badmatch,-1},
                     [{rabbit_mirror_queue_slave,update_delta,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},
                           {line,1081}]},
                      {rabbit_mirror_queue_slave,process_instruction,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},
                           {line,1009}]},
                      {rabbit_mirror_queue_slave,handle_cast,2,
                          [{file,"rabbit_mirror_queue_slave.erl"},{line,299}]},
                      {gen_server2,handle_msg,2,
                          [{file,"gen_server2.erl"},{line,1067}]},
                      {proc_lib,init_p_do_apply,3,
                          [{file,"proc_lib.erl"},{line,240}]}]}
      in function  gen_server2:terminate/3 (gen_server2.erl, line 1183)
    ancestors: [<0.12062.0>,<0.1605.0>,<0.1588.0>,<0.1587.0>,
                  rabbit_vhost_sup_sup,rabbit_sup,<0.1313.0>]
    message_queue_len: 0
    messages: []
    links: [<0.12062.0>,<0.12216.0>]
    dictionary: [{credit_blocked,[]},
                  {process_name,
                      {rabbit_mirror_queue_slave,
                          {resource,<<"/">>,queue,
                              <<"cmq-publish_then_consume">>}}},
                  {guid,{{548389537,174457376,3517217523,4084119289},1}},
                  {explicit_gc_run_operation_threshold,1000},
                  {rand_seed,
                      {#{jump => #Fun<rand.3.34006561>,
                         max => 288230376151711743,
                         next => #Fun<rand.5.34006561>,type => exsplus},
                       [247167938309581976|185153570766432085]}},
                  {credit_flow_default_credit,{400,200}}]
    trap_exit: true
    status: running
    heap_size: 18481566
    stack_size: 35
    reductions: 56778258
  neighbours:
    neighbour:
      pid: <0.12216.0>
      registered_name: []
      initial call: gm:init/1
      current_function: {gen_server2,process_next_msg,1}
      ancestors: [<0.12215.0>,<0.12062.0>,<0.1605.0>,<0.1588.0>,<0.1587.0>,
                  rabbit_vhost_sup_sup,rabbit_sup,<0.1313.0>]
      message_queue_len: 0
      links: [<0.12215.0>]
      trap_exit: false
      status: waiting
      heap_size: 610
      stack_size: 11
      reductions: 115417
      current_stacktrace: [{gen_server2,process_next_msg,1,
                               [{file,"gen_server2.erl"},{line,673}]},
                  {proc_lib,init_p_do_apply,3,
                            [{file,"proc_lib.erl"},{line,240}]}]
    supervisor: {<0.12062.0>,rabbit_amqqueue_sup}
    errorContext: child_terminated
    reason: {{badmatch,-1},
             [{rabbit_mirror_queue_slave,update_delta,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,1081}]},
              {rabbit_mirror_queue_slave,process_instruction,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,1009}]},
              {rabbit_mirror_queue_slave,handle_cast,2,
                  [{file,"rabbit_mirror_queue_slave.erl"},{line,299}]},
              {gen_server2,handle_msg,2,
                  [{file,"gen_server2.erl"},{line,1067}]},
              {proc_lib,init_p_do_apply,3,
                  [{file,"proc_lib.erl"},{line,240}]}]}
    offender: [{pid,<0.12215.0>},
               {id,rabbit_amqqueue},
               {mfargs,
                   {rabbit_prequeue,start_link,
                       [{amqqueue,
                            {resource,<<"/">>,queue,
                                <<"cmq-publish_then_consume">>},
                            true,false,none,
                            [{<<"x-max-length">>,long,100000},
                             {<<"x-queue-mode">>,longstr,<<"lazy">>},
                             {<<"x-queue-version">>,long,2}],
                            <48209.8417.0>,[],[],[],
                            [{vhost,<<"/">>},
                             {name,<<"ha">>},
                             {pattern,<<"cmq.*">>},
                             {'apply-to',<<"queues">>},
                             {definition,
                                 [{<<"ha-mode">>,<<"exactly">>},
                                  {<<"ha-params">>,3},
                                  {<<"ha-sync-mode">>,<<"automatic">>}]},
                             {priority,1}],
                            undefined,
                            [{<48209.8420.0>,<48209.8417.0>}],
                            [],live,0,[],<<"/">>,
                            #{user => <<"guest">>},
                            rabbit_classic_queue,#{}},
                        slave,<0.12061.0>]}},
               {restart_type,intrinsic},
               {shutdown,600000},
               {child_type,worker}]

Steps to reproduce:

  1. git checkout v3.11.x
  2. bazel run start-cluster
  3. rabbitmqctl -n rabbit-0 set_policy --apply-to queues --priority 1 ha "cmq.*" '{"ha-mode":"exactly", "ha-params": 3, "ha-sync-mode": "automatic"}'
  4. Publish messages to a queue and then consume them:
perf-test -x 1 -y 0 -c 3000 -u cmq-publish_then_consume -qa x-max-length=100000,x-queue-version=2,x-queue-mode=lazy -C 100000 -ad false -f persistent
perf-test -x 0 -y 1 -u cmq-publish_then_consume -qa x-max-length=100000,x-queue-version=2,x-queue-mode=lazy -ad false -f persistent

The issue doesn't seem to occur with CQv1. With v2, it occurs both with lazy and non-lazy, but it seems easier to reproduce with lazy. In both cases, given enough messages in the queue (the above 100k is generally not sufficient but just change it to 5M or something), it leads to a memory spike. Especially with non-lazy CQv2, that spike often leads to a memory alarm or even OOMkill. It looks like this:
crash

I was not able to reproduce using main.

@michaelklishin
Copy link
Member

main uses a new version of CQv2, so this makes sense.

@mkuratczyk
Copy link
Contributor Author

It seems like there is a secret ingredient to this issue: the very latest perf-test. I originally hit this on Kubernetes where I use pivotalrabbitm/perf-test:dev-2.19 image built from main. Locally I also used make run ARGS.... However, I can't reproduce it using a perf-test JAR from two months ago. We are looking into this with Arnaud. A likely cause is this:
rabbitmq/rabbitmq-java-client#814

@michaelklishin
Copy link
Member

Earlier PerfTest versions we not "concurrent enough" to matter :P

@mkuratczyk
Copy link
Contributor Author

My bad, perf-test version doesn't matter. I was just relying on specifying --queue-args multiple times which is a new perf-test feature (args get merged, in the past only the first occurrence was considered).

@mkuratczyk
Copy link
Contributor Author

v3.10.x is affected as well

@lhoguin
Copy link
Contributor

lhoguin commented Oct 20, 2022

This is caused by v2 not storing the MsgId value when the old message store isn't used. When acking the SeqId is used against the master, but the MsgId is used against the mirrors. I'm not sure why that is, surely the slaves ought to have messages in the same order as the master... It appears to have been done that way since the very beginning.

This was referenced Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants