Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][ci] Disable test that causes OOME until the problem has been resolved #22586

Merged
merged 1 commit into from
Apr 25, 2024

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Apr 25, 2024

Motivation

Unit test group 1 fails often with OOME. (example)

Modifications

The issue is most like related to #21495 and org.apache.pulsar.broker.service.ReplicatorSubscriptionTest#testWriteMarkerTaskOfReplicateSubscriptions .
Disable the test until the problem has been resolved.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

In one of the heap dumps, there was 251,029 lambdas which all reference a __change_events topic where the namespace starts with testReplicateSubBackLog. This confirms that the test that is disabled in this PR is causing the issue.

Using https://github.com/vlsi/mat-calcite-plugin to query the heap dump.

select this['arg$2.completeTopicName'], count(*) from "org.apache.pulsar.broker.resources.NamespaceResources$PartitionedTopicResources$$Lambda$1819+0x00007f08a8b65ee8" group by 1
EXPR$0                                                                                          |  EXPR$1
----------------------------------------------------------------------------------------------------------
persistent://pulsar/testReplicateSubBackLog-88c7c05a-0f07-4ed2-b82a-9b911afad922/__change_events| 251,029
----------------------------------------------------------------------------------------------------------

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

In another heapdump

select this['arg$2.completeTopicName'], count(*) from "org.apache.pulsar.broker.resources.NamespaceResources$PartitionedTopicResources$$Lambda$3405+0x00007fae50f7b000"  group by 1
EXPR$0                                                                                              |  EXPR$1
--------------------------------------------------------------------------------------------------------------
persistent://pulsar/testReplicateSubBackLog-acf0e69a-2836-430c-bf1e-c9babd059179/replication-disable|       4
persistent://pulsar/testReplicateSubBackLog-acf0e69a-2836-430c-bf1e-c9babd059179/__change_events    | 277,899
Total: 2 entries                                                                                    | 277,903
--------------------------------------------------------------------------------------------------------------

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

There are a few recent replicator related changes #21946, #21948 and #22537 . @poorbarcode please check if one of the changes is triggering the OOME issue possibly related to deletion. There are a lot of entries for __change_events topic for the replicated namespace.

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

Just wondering if the problem is somehow related to namespace deletion with replication enabled.
The namespace deletion code is something that will need to be refactored in any case to get the concurrency into control.

private void internalRetryableDeleteNamespaceAsync0(boolean force, int retryTimes,
@Nonnull CompletableFuture<Void> callback) {
precheckWhenDeleteNamespace(namespaceName, force)
.thenCompose(policies -> {
final CompletableFuture<List<String>> topicsFuture;
if (policies == null || CollectionUtils.isEmpty(policies.replication_clusters)){
topicsFuture = pulsar().getNamespaceService().getListOfPersistentTopics(namespaceName);
} else {
topicsFuture = pulsar().getNamespaceService().getFullListOfTopics(namespaceName);
}
return topicsFuture.thenCompose(allTopics ->
pulsar().getNamespaceService().getFullListOfPartitionedTopic(namespaceName)
.thenCompose(allPartitionedTopics -> {
List<List<String>> topicsSum = new ArrayList<>(2);
topicsSum.add(allTopics);
topicsSum.add(allPartitionedTopics);
return CompletableFuture.completedFuture(topicsSum);
}))
.thenCompose(topics -> {
List<String> allTopics = topics.get(0);
Set<String> allUserCreatedTopics = new HashSet<>();
List<String> allPartitionedTopics = topics.get(1);
Set<String> allUserCreatedPartitionTopics = new HashSet<>();
boolean hasNonSystemTopic = false;
Set<String> allSystemTopics = new HashSet<>();
Set<String> allPartitionedSystemTopics = new HashSet<>();
Set<String> topicPolicy = new HashSet<>();
Set<String> partitionedTopicPolicy = new HashSet<>();
for (String topic : allTopics) {
if (!pulsar().getBrokerService().isSystemTopic(TopicName.get(topic))) {
hasNonSystemTopic = true;
allUserCreatedTopics.add(topic);
} else {
if (SystemTopicNames.isTopicPoliciesSystemTopic(topic)) {
topicPolicy.add(topic);
} else if (!isDeletedAlongWithUserCreatedTopic(topic)) {
allSystemTopics.add(topic);
}
}
}
for (String topic : allPartitionedTopics) {
if (!pulsar().getBrokerService().isSystemTopic(TopicName.get(topic))) {
hasNonSystemTopic = true;
allUserCreatedPartitionTopics.add(topic);
} else {
if (SystemTopicNames.isTopicPoliciesSystemTopic(topic)) {
partitionedTopicPolicy.add(topic);
} else {
allPartitionedSystemTopics.add(topic);
}
}
}
if (!force) {
if (hasNonSystemTopic) {
throw new RestException(Status.CONFLICT, "Cannot delete non empty namespace");
}
}
final CompletableFuture<Void> markDeleteFuture;
if (policies != null && policies.deleted) {
markDeleteFuture = CompletableFuture.completedFuture(null);
} else {
markDeleteFuture = namespaceResources().setPoliciesAsync(namespaceName, old -> {
old.deleted = true;
return old;
});
}
return markDeleteFuture.thenCompose(__ ->
internalDeleteTopicsAsync(allUserCreatedTopics))
.thenCompose(ignore ->
internalDeletePartitionedTopicsAsync(allUserCreatedPartitionTopics))
.thenCompose(ignore ->
internalDeleteTopicsAsync(allSystemTopics))
.thenCompose(ignore ->
internalDeletePartitionedTopicsAsync(allPartitionedSystemTopics))
.thenCompose(ignore ->
internalDeleteTopicsAsync(topicPolicy))
.thenCompose(ignore ->
internalDeletePartitionedTopicsAsync(partitionedTopicPolicy));
});
})
.thenCompose(ignore -> pulsar().getNamespaceService()
.getNamespaceBundleFactory().getBundlesAsync(namespaceName))
.thenCompose(bundles -> FutureUtil.waitForAll(bundles.getBundles().stream()
.map(bundle -> pulsar().getNamespaceService().checkOwnershipPresentAsync(bundle)
.thenCompose(present -> {
// check if the bundle is owned by any broker,
// if not then we do not need to delete the bundle
if (present) {
PulsarAdmin admin;
try {
admin = pulsar().getAdminClient();
} catch (PulsarServerException ex) {
log.error("[{}] Get admin client error when preparing to delete topics.",
clientAppId(), ex);
return FutureUtil.failedFuture(ex);
}
return admin.namespaces().deleteNamespaceBundleAsync(namespaceName.toString(),
bundle.getBundleRange(), force);
}
return CompletableFuture.completedFuture(null);
})
).collect(Collectors.toList())))
.thenCompose(ignore -> internalClearZkSources())
.whenComplete((result, error) -> {
if (error != null) {
final Throwable rc = FutureUtil.unwrapCompletionException(error);
if (rc instanceof MetadataStoreException) {
if (rc.getCause() != null && rc.getCause() instanceof KeeperException.NotEmptyException) {
log.info("[{}] There are in-flight topics created during the namespace deletion, "
+ "retry to delete the namespace again.", namespaceName);
final int next = retryTimes - 1;
if (next > 0) {
// async recursive
internalRetryableDeleteNamespaceAsync0(force, next, callback);
} else {
callback.completeExceptionally(
new RestException(Status.CONFLICT, "The broker still have in-flight topics"
+ " created during namespace deletion, please try again."));
// drop out recursive
}
return;
}
}
callback.completeExceptionally(error);
return;
}
callback.complete(result);
});
}

The concurrency issue is explained in #22541 (comment)

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

the namespace deletion in the test might be the code that triggers the problem:

// 4. Clear resource.
pulsar1.getConfiguration().setForceDeleteNamespaceAllowed(true);
admin1.namespaces().deleteNamespace(namespace, true);
pulsar1.getConfiguration().setForceDeleteNamespaceAllowed(false);

@poorbarcode do you have a chance to debug this issue?

@lhotari lhotari merged commit 6a94231 into apache:master Apr 25, 2024
52 of 53 checks passed
@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

There are more problems. Using heap dump from https://github.com/apache/pulsar/actions/runs/8835173621/attempts/1?pr=22583

select toString(this['stack.fn.arg$1']), count(*) from java.util.concurrent.CompletableFuture where this['stack.fn'] is not null group by 1 order by 2 desc
EXPR$0                                                                                                         |  EXPR$1
-------------------------------------------------------------------------------------------------------------------------
org.apache.pulsar.broker.service.persistent.SystemTopic @ 0x100031c173f8                                       | 435,961
org.apache.pulsar.broker.resources.NamespaceResources$PartitionedTopicResources @ 0x10002b285920               |  56,647
                                                                                                               |     144
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10000a69cfe0                                |      17
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x100011873a80                                |      17
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x100007eb5af8                                |      13
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x100018f4fd48                                |      13
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10002b0e1900                                |      12
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10001c039270                                |      12
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10003be7e448                                |      11
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10003bfa8e60                                |      11
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10000c4b8f78                                |       9
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient @ 0x10000c439b80                                |       9
org.apache.pulsar.client.impl.BinaryProtoLookupService @ 0x10002d78f820                                        |       7
org.apache.pulsar.broker.service.persistent.PersistentTopic @ 0x1000281e0380                                   |       6
org.apache.pulsar.client.impl.PulsarClientImpl @ 0x10002d78dd90                                                |       5
org.apache.pulsar.client.impl.ConnectionHandler @ 0x10002d60cfc8                                               |       4
org.apache.pulsar.client.impl.ConnectionHandler @ 0x10002d619f28                                               |       4
org.apache.pulsar.client.impl.ConnectionHandler @ 0x100024745fc8                                               |       4
ASSIGN                                                                                                         |       4
org.apache.pulsar.client.impl.ConnectionHandler @ 0x10002a0f8390                                               |       4
org.apache.pulsar.client.impl.ConnectionHandler @ 0x10002d6132f8                                               |       4
org.apache.pulsar.client.impl.ConnectionHandler @ 0x10002d796090                                               |       4
org.apache.pulsar.broker.namespace.OwnershipCache$OwnedServiceUnitCacheLoader @ 0x10001c0314e8                 |       3
org.apache.pulsar.broker.namespace.OwnershipCache$OwnedServiceUnitCacheLoader @ 0x10003be76ef8                 |       2
org.apache.pulsar.broker.service.persistent.GeoPersistentReplicator @ 0x1003f30fc0c8                           |       2
org.apache.pulsar.metadata.impl.ZKMetadataStore @ 0x1003ef17fef0                                               |       2
org.apache.pulsar.client.impl.TableViewImpl @ 0x10002725bb30                                                   |       2
org.apache.pulsar.client.impl.TableViewImpl @ 0x10002a1bf700                                                   |       2
org.apache.pulsar.client.impl.TableViewImpl @ 0x10002d6684a0                                                   |       2
org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService @ 0x1000182bb060                         |       2
org.apache.pulsar.broker.systopic.TopicPoliciesSystemTopicClient @ 0x1003f0a66de0                              |       2
org.apache.pulsar.broker.namespace.OwnershipCache$OwnedServiceUnitCacheLoader @ 0x10000c5040b0                 |       2
org.apache.pulsar.broker.namespace.OwnershipCache$OwnedServiceUnitCacheLoader @ 0x100018f47fb0                 |       2
java.util.concurrent.CompletableFuture @ 0x1003ed849b98                                                        |       1
persistent://pulsar/global/removeClusterTest/__change_events                                                   |       1
org.apache.pulsar.client.impl.ReaderImpl @ 0x10002a0f22f0                                                      |       1
java.util.concurrent.CompletableFuture @ 0x1003ef194a50                                                        |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1000234ce2b8|       1
org.apache.pulsar.client.impl.ConnectionPool @ 0x10002d78e760                                                  |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003eb48e700|       1
org.apache.pulsar.client.impl.ReaderImpl @ 0x10002d613e88                                                      |       1
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl @ 0x100011873a20                                  |       1
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl @ 0x10003be700d0                                  |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003fda42068|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1000234ce410|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003eb4dbbc0|       1
org.apache.pulsar.client.impl.ReaderImpl @ 0x1003fbf276b0                                                      |       1
java.util.concurrent.CompletableFuture @ 0x100011edcc18                                                        |       1
org.apache.pulsar.client.impl.ReaderImpl @ 0x10002d60d258                                                      |       1
org.apache.pulsar.broker.service.persistent.SystemTopic @ 0x100011877ef0                                       |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003fdae3248|       1
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl @ 0x10001c02a548                                  |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x100023429888|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003fda421c0|       1
org.apache.pulsar.compaction.StrategicTwoPhaseCompactor @ 0x1000392ec5a0                                       |       1
java.util.concurrent.CompletableFuture @ 0x1003f0a67038                                                        |       1
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl @ 0x10000c4f6a20                                  |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1000234fcbe8|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1000234fca90|       1
org.apache.pulsar.metadata.coordination.impl.LockManagerImpl @ 0x100018f40f28                                  |       1
java.util.concurrent.CompletableFuture @ 0x100039d71458                                                        |       1
org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService @ 0x10002a0a3ae8                         |       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1000234299e0|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003fdae39b8|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003eb48e858|       1
org.apache.pulsar.metadata.bookkeeper.PulsarRegistrationClient$$Lambda$3215+0x00007f588ce12e20 @ 0x1003eb4dba68|       1
Total: 67 entries                                                                                              | 492,978
-------------------------------------------------------------------------------------------------------------------------

@lhotari
Copy link
Member Author

lhotari commented Apr 25, 2024

select toString(this['result.ex.detailMessage']), count(*) from java.util.concurrent.CompletableFuture where this['result.ex.detailMessage'] is not null group by 1 order by 2 desc
EXPR$0                                                                                                                                                                                                      |  EXPR$1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
org.apache.pulsar.metadata.api.MetadataStoreException$NotFoundException:                                                                                                                                    | 209,380
Lock was not in valid state: Releasing                                                                                                                                                                      |      23
BookKeeper client is closed                                                                                                                                                                                 |      15
org.apache.bookkeeper.mledger.ManagedLedgerException: java.util.concurrent.CompletionException: org.apache.bookkeeper.mledger.ManagedLedgerException$CursorAlreadyClosedException: Cursor was already closed|       2
Failed to close clients before deleting topic.                                                                                                                                                              |       1
Total: 5 entries                                                                                                                                                                                            | 209,421
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants