Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker keep logging Mark deleting an already mark-deleted position #13788

Open
Shawyeok opened this issue Jan 17, 2022 · 8 comments
Open

Broker keep logging Mark deleting an already mark-deleted position #13788

Shawyeok opened this issue Jan 17, 2022 · 8 comments
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@Shawyeok
Copy link
Contributor

Shawyeok commented Jan 17, 2022

Describe the bug
There is a subscription keep logging Mark deleting an already mark-deleted position in our production cluster.

[persistent://op/growth/sms_produce_b_api_0-partition-2][growth] Failed to delete message at 71317494:-1: org.apache.bookkeeper.mledger.ManagedLedgerException: java.lang.IllegalArgumentException: Mark deleting an already mark-deleted position

Topic persistent://op/growth/sms_produce_b_api_0 has 3 partitions, but only partition-2 keep logging this issue.

The internal-stats shows cursor's markDeletePosition is after the firstRange of individuallyDeletedMessages:
image

So every time handle message ack it will select the upper endpoint of the firstRange of individuallyDeletedMessages which is ahead of markDeletePosition, so a IllegalArgumentException will be throw at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#setAcknowledgedPosition.

// If the lower bound of the range set is the current mark delete position, then we can trigger a new
// mark-delete to the upper bound of the first range segment
Range<PositionImpl> range = individualDeletedMessages.firstRange();
// If the lowerBound is ahead of MarkDelete, verify if there are any entries in-between
if (range.lowerEndpoint().compareTo(markDeletePosition) <= 0 || ledger
.getNumberOfEntries(Range.openClosed(markDeletePosition, range.lowerEndpoint())) <= 0) {
if (log.isDebugEnabled()) {
log.debug("[{}] Found a position range to mark delete for cursor {}: {} ", ledger.getName(),
name, range);
}
newMarkDeletePosition = range.upperEndpoint();
}
if (newMarkDeletePosition != null) {
newMarkDeletePosition = setAcknowledgedPosition(newMarkDeletePosition);
} else {
newMarkDeletePosition = markDeletePosition;
}

PositionImpl setAcknowledgedPosition(PositionImpl newMarkDeletePosition) {
if (newMarkDeletePosition.compareTo(markDeletePosition) < 0) {
throw new IllegalArgumentException("Mark deleting an already mark-deleted position");
}

Arthas watch output

[arthas@261046]$ watch -n 5 -e org.apache.bookkeeper.mledger.impl.ManagedCursorImpl setAcknowledgedPosition {target,params[0],target.markDeletePosition,target.name,target.ledger.name}
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 1) cost in 233 ms, listenerId: 18
method=org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition location=AtExceptionExit
ts=2022-01-17 16:54:37; [cost=0.365992ms] result=@ArrayList[
    @ManagedCursorImpl[ManagedCursorImpl{ledger=op/growth/persistent/sms_produce_a_api_0-partition-2, name=growth, ackPos=71317769:-1, readPos=86445249:1465}],
    @PositionImpl[71257569:341],
    @PositionImpl[71317769:-1],
    @String[growth],
    @String[op/growth/persistent/sms_produce_a_api_0-partition-2],
]
method=org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition location=AtExceptionExit
ts=2022-01-17 16:54:44; [cost=0.094147ms] result=@ArrayList[
    @ManagedCursorImpl[ManagedCursorImpl{ledger=op/growth/persistent/sms_produce_a_api_0-partition-2, name=growth, ackPos=71317769:-1, readPos=86445249:1466}],
    @PositionImpl[71257569:341],
    @PositionImpl[71317769:-1],
    @String[growth],
    @String[op/growth/persistent/sms_produce_a_api_0-partition-2],
]
method=org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition location=AtExceptionExit
ts=2022-01-17 16:54:47; [cost=0.09929ms] result=@ArrayList[
    @ManagedCursorImpl[ManagedCursorImpl{ledger=op/growth/persistent/sms_produce_a_api_0-partition-2, name=growth, ackPos=71317769:-1, readPos=86445249:1467}],
    @PositionImpl[71257569:341],
    @PositionImpl[71317769:-1],
    @String[growth],
    @String[op/growth/persistent/sms_produce_a_api_0-partition-2],
]
method=org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition location=AtExceptionExit
ts=2022-01-17 16:54:48; [cost=0.056561ms] result=@ArrayList[
    @ManagedCursorImpl[ManagedCursorImpl{ledger=op/growth/persistent/sms_produce_a_api_0-partition-2, name=growth, ackPos=71317769:-1, readPos=86445249:1468}],
    @PositionImpl[71257569:341],
    @PositionImpl[71317769:-1],
    @String[growth],
    @String[op/growth/persistent/sms_produce_a_api_0-partition-2],
]

I don't know how to reproduce this situation yet, please let me know if need more details.

Additional context

  • pulsar version: 2.8.1
  • OS: CentOS Linux release 7.9.2009 (Core)
  • kernel: 3.10.0-1160.45.1.el7.x86_64
@codelipenghui
Copy link
Contributor

@Shawyeok It looks like the application holds some old messages and tries to ack them, or could you please help check does the ledger ID 71317494 belongs to this topic?

@Shawyeok
Copy link
Contributor Author

could you please help check does the ledger ID 71317494 belongs to this topic?

@codelipenghui It belongs to partition-1, that's weird.

$ get_ledger_b 71317494
{
  "71317494" : {
    "ledgerId" : 71317494,
    "metadataFormatVersion" : 3,
    "ensembleSize" : 2,
    "writeQuorumSize" : 2,
    "ackQuorumSize" : 2,
    "state" : "CLOSED",
    "length" : 4007390,
    "lastEntryId" : 2733,
    "ctime" : 1637648110933,
    "digestType" : "CRC32C",
    "password" : "",
    "customMetadata" : {
      "component" : "bWFuYWdlZC1sZWRnZXI=",
      "pulsar/managed-ledger" : "b3AvZ3Jvd3RoL3BlcnNpc3RlbnQvc21zX3Byb2R1Y2VfYl9hcGlfMC1wYXJ0aXRpb24tMQ==",
      "application" : "cHVsc2Fy"
    },
    "allEnsembles" : {
      "0" : [ {
        "id" : "aaa:3181"
      }, {
        "id" : "bbb:3181"
      } ]
    },
    "ctoken" : 0,
    "closed" : true
  }
}
$ echo "b3AvZ3Jvd3RoL3BlcnNpc3RlbnQvc21zX3Byb2R1Y2VfYl9hcGlfMC1wYXJ0aXRpb24tMQ==" | base64 -d
op/growth/persistent/sms_produce_b_api_0-partition-1

@346359827 346359827 mentioned this issue Jan 19, 2022
@michaeljmarshall
Copy link
Member

@Shawyeok - have you tested running with a 2.8.2 broker for this cluster? It'd be helpful to know if you're observing the same behavior for that version, too.

@dave2wave
Copy link
Member

I'm seeing this error while doing OMB / Offloading tests of Pulsar 2.8.3 rc3.

Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: 17:43:46.391 [bookkeeper-ml-scheduler-OrderedScheduler-2-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [benchmark/ns-Y4Ft4Qc/persistent/test-6Kvlcnk-0004-partition-2][sub-000-0fjC-_4] Failed to flush mark-delete position
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: org.apache.bookkeeper.mledger.ManagedLedgerException: java.lang.IllegalArgumentException: Mark deleting an already mark-deleted position. Current mark-delete: 238:7033 -- attempted mark delete: 238:6933
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: Caused by: java.lang.IllegalArgumentException: Mark deleting an already mark-deleted position. Current mark-delete: 238:7033 -- attempted mark delete: 238:6933
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.setAcknowledgedPosition(ManagedCursorImpl.java:1582) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.asyncMarkDelete(ManagedCursorImpl.java:1722) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.flush(ManagedCursorImpl.java:3044) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$flushCursors$1(ManagedLedgerFactoryImpl.java:213) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$flushCursors$2(ManagedLedgerFactoryImpl.java:213) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772) ~[?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.flushCursors(ManagedLedgerFactoryImpl.java:209) ~[org.apache.pulsar-managed-ledger-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) [org.apache.pulsar-pulsar-common-2.8.3.jar:2.8.3]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:705) [com.google.guava-guava-30.1-jre.jar:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
Mar  2 17:43:46 ip-10-0-0-250 pulsar[20582]: #011at java.lang.Thread.run(Thread.java:829) [?:?]

The mark-delete seem to be slightly out of sequence.

@github-actions
Copy link

github-actions bot commented Apr 2, 2022

The issue had no activity for 30 days, mark with Stale label.

@lhotari
Copy link
Member

lhotari commented Apr 6, 2022

#14261 and #15031 might resolve the issue shown in the description.

@github-actions
Copy link

github-actions bot commented May 7, 2022

The issue had no activity for 30 days, mark with Stale label.

@ethqunzhong
Copy link
Contributor

'm seeing this error while doing OMB / Offloading tests of Pulsar 2.8.3 rc3.

yes, i also see this error while offloading of Pulsar 2.8.3 with s3 offloader in product env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

6 participants