Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Duplicate Unique ID setting in OpenSearch transport #771

Closed
dbwiddis opened this issue May 22, 2023 · 6 comments · Fixed by opensearch-project/OpenSearch#8228
Closed
Assignees
Labels
bug Something isn't working

Comments

@dbwiddis
Copy link
Member

dbwiddis commented May 22, 2023

What is the bug?

OpenSearch's Security Plugin prevents connections that are not identified as a node. As part of implementing security features, a transport header "extension_unique_id" was added to identify extension "nodes" to the plugin.

However, this header is being set twice and causing IllegalArgumentExceptions on initialization:

org.opensearch.bootstrap.StartupException: RemoteTransportException[[ad-extension][127.0.0.1:4532][internal:discovery/extensions]]; nested: IllegalArgumentException[value for key [extension_unique_id] already present];
	at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-3.0.0.jar:3.0.0]
	at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-3.0.0.jar:3.0.0]
Caused by: org.opensearch.transport.RemoteTransportException: [ad-extension][127.0.0.1:4532][internal:discovery/extensions]
Caused by: java.lang.IllegalArgumentException: value for key [extension_unique_id] already present
	at org.opensearch.common.util.concurrent.ThreadContext$ThreadContextStruct.putSingleHeader(ThreadContext.java:590) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ThreadContextStruct.putRequest(ThreadContext.java:584) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.common.util.concurrent.ThreadContext.putHeader(ThreadContext.java:401) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.transport.TransportService.lambda$connectionValidatorForExtensionConnectingToNode$15(TransportService.java:547) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.transport.ClusterConnectionManager.lambda$connectToNode$3(ClusterConnectionManager.java:158) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[opensearch-3.0.0.jar:3.0.0]
	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) ~[opensearch-3.0.0.jar:3.0.0]

How can one reproduce the bug?

Start up an extension on a non-local node.

Start up OpenSearch and initialize the extension.

What is the expected behavior?

Extension is successfully initialized.

What is your host/environment?

Any remote host initialization: this has been seen in attempts to do performance testing on two servers, and in CI setup. It does not manifest when the extension node is local to the OpenSearch node, as the "localNode" code branches bypass the transport connection validation that throws the error.

Do you have any additional context?

The root cause is in the ExtensionsInitRequestHandler. During the initial handling of a transport request by an extension, the header is added here:

extensionsRunner.getThreadPool().getThreadContext().putHeader("extension_unique_id", extensionNode.getId());

That header is communicated back to OpenSearch with the response in this try block:

try {
return new InitializeExtensionResponse(
extensionsRunner.getSettings().get(NODE_NAME_SETTING),
extensionsRunner.getExtensionImplementedInterfaces()
);

However, because it is a transport connection, the response is asynchronous so the thread context on OpenSearch may not be immediately updated.

In the finally block immediately following this, the Extension creates a separate Transport Connection (if it has the right host and port, see #761, #769, and other related issues):

extensionTransportService.connectToNodeAsExtension(
extensionInitRequest.getSourceNode(),
extensionInitRequest.getExtension().getId()
);

Log files indicate this connection tends to update the header first, so when the earlier response is handled by an ActionListener, it causes the exception.

Possible fix 1:

See #767 to completely eliminate this second connection

Possible fix 2:

Have a different key for outbound vs. inbound Transport connections: the response on line 77 above could have "-outbound" appended to it, while the new connection on lines 98-101 could have "-inbound" appended.

Possible fix 3:

Check whether the header already exists on the connection. Unfortunately given the multithreaded environment I'm not sure a simple conditional would work. It may be better to just catch the IllegalArgumentException and ignore this specific failure.

Possible fix 4:

In the ThreadContext header adding code, allow putting the same header twice. Specifically, change

private static <T> void putSingleHeader(String key, T value, Map<String, T> newHeaders) {
    if (newHeaders.putIfAbsent(key, value) != null) {
        throw new IllegalArgumentException("value for key [" + key + "] already present");
    }
}

to

private static <T> void putSingleHeader(String key, T value, Map<String, T> newHeaders) {
    T oldValue = newHeaders.putIfAbsent(key, value);
    if (oldValue != null && !oldValue.equals(value)) {
        throw new IllegalArgumentException("different value for key [" + key + "] already present");
    }
}

Temporary workaround:

For near term testing "without security" change the connectToNodeAsExtension() back to connectToNode() in the SDK.

@cwperks
Copy link
Member

cwperks commented May 22, 2023

Links back to discussions on the original PR that connectToNodeAsExtension was introduced:

It should not be setting the header twice and initially when implementing the TLS changes I only had putHeader called directly within ExtensionInitiRequestHandler here: https://github.com/opensearch-project/opensearch-sdk-java/blob/main/src/main/java/org/opensearch/sdk/handlers/ExtensionsInitRequestHandler.java#L77

I then faced a problem with connectToNode down below when it was performing a handshake with the cluster the header was not present. To fix that I explicitly set the header in connectToExtensionAsNode. A race condition may explain why it was not set during the call to connectToNode, but I'm still not positive.

@ryanbogan
Copy link
Member

ryanbogan commented May 25, 2023

@dbwiddis @cwperks We have a temporary fix that checks to see if the header is present before putting the header in the thread context. Should I raise a PR with that change for now until we find the root cause?

@cwperks
Copy link
Member

cwperks commented May 25, 2023

@ryanbogan @dbwiddis Can you post steps to reproduce the error? I am not able to reproduce this locally.

@ryanbogan
Copy link
Member

@cwperks It's not consistent, but most of the time it occurs when running an extension, stopping it, and then running it again.

@dbwiddis
Copy link
Member Author

@ryanbogan @dbwiddis Can you post steps to reproduce the error? I am not able to reproduce this locally.

This is 100% reproducible in a cluster environment; set up a 1-CM 1-DataNode cluster using the CDK. Shut down and start up the CM and observe the error.

I can work around the bug by changing the key used in the header in ExtensensionsInitResponseHandler, indicating that somehow this header is getting inserted in the map that later conflicts with the connectToNodeAsExtension call.

I'm really not sure what the purpose of adding the header to the OpenSearch transport connection from the SDK side is. Why don't we add that header on the OpenSearch side before we even establish the connection?

@cwperks
Copy link
Member

cwperks commented Jun 23, 2023

@dbwiddis Assuming that all transport actions are initiated from OpenSearch then it should be possible to populate the header similar to how the cluster name is populated here on all transport requests as long as its possible to determine which extension a transport request is bound to.

I have been thinking about this header as another form of trusted cross cluster communication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants