Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected #14537

Closed
wants to merge 7 commits into from

Conversation

nicoloboschi
Copy link
Contributor

@nicoloboschi nicoloboschi commented Mar 2, 2022

Motivation

After #14340 and #14252 you may find in the log something like that

2022-03-02T10:49:18,037+0000 [TestNG-method=testBrokerHostUsage-1] ERROR org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl - Failed to read speed for nic br-aca0e958a490, maybe you can set broker config [loadBalancerOverrideBrokerNicSpeedGbps] to override it.
java.io.IOException: Invalid argument
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_292]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) ~[?:1.8.0_292]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_292]
	at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_292]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:159) ~[?:1.8.0_292]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:1.8.0_292]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:1.8.0_292]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:1.8.0_292]
	at java.nio.file.Files.read(Files.java:3105) ~[?:1.8.0_292]
	at java.nio.file.Files.readAllBytes(Files.java:3158) ~[?:1.8.0_292]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.lambda$null$3(LinuxBrokerHostUsageImpl.java:257) ~[classes/:?]
	at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:244) ~[?:1.8.0_292]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[?:1.8.0_292]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_292]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_292]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_292]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_292]
	at java.util.stream.DoublePipeline.collect(DoublePipeline.java:500) ~[?:1.8.0_292]
	at java.util.stream.DoublePipeline.sum(DoublePipeline.java:411) ~[?:1.8.0_292]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.lambda$getTotalNicLimitKbps$4(LinuxBrokerHostUsageImpl.java:263) ~[classes/:?]
	at java.util.Optional.orElseGet(Optional.java:267) [?:1.8.0_292]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.getTotalNicLimitKbps(LinuxBrokerHostUsageImpl.java:253) [classes/:?]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.calculateBrokerHostUsage(LinuxBrokerHostUsageImpl.java:104) [classes/:?]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.<init>(LinuxBrokerHostUsageImpl.java:90) [classes/:?]
	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.<init>(LinuxBrokerHostUsageImpl.java:66) [classes/:?]
	at org.apache.pulsar.broker.loadbalance.SimpleLoadManagerImplTest.testBrokerHostUsage(SimpleLoadManagerImplTest.java:440) [test-classes/:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132) [testng-7.3.0.jar:?]
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45) [testng-7.3.0.jar:?]
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73) [testng-7.3.0.jar:?]
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11) [testng-7.3.0.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

There are some NIC that are of type 1 but that doesn't expose the speed.
The stacktrace is not useful and it's really scary.

Note that since #14252 has been cherry-picked to 2.8, 2.9, 2.10 it's recommended to pick this one as well.

Modifications

  • Check for Invalid argument exception and only log the suggestion message
  • no-need-doc

@nicoloboschi
Copy link
Contributor Author

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 2, 2022
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider failing during broker start up when loadBalancerOverrideBrokerNicSpeedGbps is not configured and the broker cannot determine NIC speed for all NICs (and when load balancing is enabled).

The current design is to frequently log an error while ignoring the NIC. The only real course of action for an operator is to re-configure the broker. It seems like a better course of action to fail on start up so that operators know immediately that they need to fix the configuration instead of finding out some time later when observing the logs.

@mattisonchao
Copy link
Member

@nicoloboschi @michaeljmarshall

I very much agree with Michael, Would you mind thinking about this approach?
If you don't have time, I'd like to do that.

@nicoloboschi
Copy link
Contributor Author

@mattisonchao thanks, please go ahead

@michaeljmarshall
Copy link
Member

@mattisonchao - sure, go ahead. For context, we discussed this issue briefly at the community meeting today and there was consensus that failing on startup is the right behavior.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@nicoloboschi
Copy link
Contributor Author

@mattisonchao @michaeljmarshall please note that some CI environments could have particular NIC and throwing blocking error will make them fail.

@michaeljmarshall
Copy link
Member

@nicoloboschi - that's a great thing to call out. The solution is to configure the loadBalancerOverrideBrokerNicSpeedGbps for the CI env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants