Change AllowMulticast to spdp #34

rotu · 2019-06-27T18:18:02Z

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.

rotu · 2019-06-29T12:43:00Z

Point of clarification: this change does not prevent multiple listeners from subscribing to the same topic - it just makes OpenSplice use a single multicast stream for each. This could result in more bandwidth uses in some cases, but reduces latency by a large multiple when WiFi is involved. https://tools.ietf.org/id/draft-mcbride-mboned-wifi-mcast-problem-statement-01.html

The downside to this change is an increase in bandwidth use over a wired network when you have a topic with lots of subscribers.

wjwwood · 2019-07-11T18:05:41Z

As requested in our triage meeting, this is the issue where we discussed (at length, sorry about that) multicast for data transmission with Fast-RTPS:

ros2/rmw_fastrtps#80

Specifically see this comment:

ros2/rmw_fastrtps#81 (comment)

Which seems to imply Fast-RTPS changed it's default behavior in a similar fashion to this issue.

clalancette · 2019-07-16T19:35:43Z

The change is clearly XML, and valid for Opensplice, so it is fine from that perspective. It also seems to match what Fast-RTPS is doing, if I understand correctly.

@cwyark , could you maybe comment here and tell us what the pros and cons of this change are?

cwyark · 2019-07-17T05:35:09Z

@clalancette @rotu opensplice_cmake_module/config/ros_ospl.xml is better to keep for the general case. e.g wired network or inter-process. For some special case, e.g wifi multicasting, it is suggested to create your own .xml, by specifying a bash environment:

OSPL_URI=file://<path-to-your-custom-xml>

For example:

$> export OSPL_URI=file:///opt/ospl/HDE/x86_64.darwin10_clang/etc/config/ospl.xml
$> ros2 run demo_nodes_cpp talker  <- opensplice in this process will apply the custom .xml

By doing this, you can flexible tune opensplice performance at development stage. Even you can have different *.xml for different ROS2 nodes for a better system wild optimization.

clalancette · 2019-07-17T13:16:01Z

@clalancette @rotu opensplice_cmake_module/config/ros_ospl.xml is better to keep for the general case.

I agree that ros_ospl.xml should be for the general case, but I think that in today's world, the general case has to include wireless.

Can you explain the downsides of using spdp on an ethernet network?

rotu · 2019-07-17T15:48:12Z

It was a shock to me that, out of the box, OpenSplice was choking on as few as 50 messages per second on a network connection capable of 300 Mbps. My response as a user was "OpenSplice is defective" not "OpenSplice is poorly tuned". It makes sense for the out-of-the-box settings to be written for the first usage of OpenSplice. As it stands, multicast on is a premature optimization with big drawbacks. It also makes sense that there should be documentation somewhere for how to tune it for best performance, including advice to turn multicast back on where appropriate.

cwyark · 2019-07-24T01:14:47Z

@clalancette if changing from

<AllowMulticast>true</AllowMulticast>

to

<AllowMulticast>spdp</AllowMulticast>

messages transporting between DataReaders and DataWriters will turn into unicast, but reserve the multicast ability to discovery each other.
For example, there are 1 DataWriter and 3 DataReader, and 1 DataWriter want to publish 1 message to these 3 DataReader.

if set AllowMulticast=true

At discovery stage: use IGMP join multicast group to search DataReaders.
At message transporting stage: DataWriter only need to send 1 multicast udp packet to multicast group (e.g. 239.255.0.x). And the router will help to forward the packets to these DataReaders.

if set AllowMulticast=spdp

At discovery stage: use IGMP join multicast group to search DataReaders.
At message transporting stage: DataWriter need to send 3 unicast udp packets to the 3 DataReaders. If there are 50 DataReaders, DataWriter need to send 50 udp packets.

Since wired network (e.g. ethernet switch) has ability to perform a real multicast, AllowMulticast=true will help reduce the total bandwidth of the network.
But there has no real multicast on WiFi, every multicast packet is still broadcast , so too many multicast packets between wireless devices will cause broadcast storming. (too many collision detection).

In conclusion,
AllowMulticast=true

Pros: reduce the bandwidth at wire network.
Cons: broadcast storming at wireless network.
AllowMulticast=spdp
Pros: to prevent from broadcast storming at wireless network.
Cons: bandwidth will increase linearly when the number of DataReaders increase both in wired and wireless network.

clalancette · 2019-07-24T13:20:10Z

@clalancette if changing from

<AllowMulticast>true</AllowMulticast>

to

<AllowMulticast>spdp</AllowMulticast>

@cwyark Thank you, that is a great explanation of the differences between them.

Given that SPDP is already the default in Fast-RTPS, then I think we should do the same for Opensplice. I'm going to run CI on this change and merge assuming it passes.

clalancette · 2019-07-24T13:22:27Z

CI:

Linux
Linux-aarch64
macOS
Windows

clalancette · 2019-07-24T19:04:29Z

So, a bunch of the CI warnings are known issues. However, there are enough failing tests with Opensplice that makes me suspect this change. For instance, https://ci.ros2.org/job/ci_linux/7643/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/ is known to be flaky on aarch64 (see ros2/build_farmer#133), but here it failed on amd64 as well. macOS also has a bunch of failures in the opensplice tests which don't seem to be present in the nightly (https://ci.ros2.org/view/nightly/job/nightly_osx_debug/1321/#showFailuresLink).

Therefore, I don't think we can merge this as-is. @rotu Can you take a look locally and see if you can reproduce/fix the failures? Thanks.

rotu · 2019-07-24T19:40:59Z

@clalancette
I can't repro the issue. Here's the output on my machine:

$ uname -a
Linux rhea 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ colcon test --packages-select-regex ".*pendulum.*"
Starting >>> pendulum_msgs
Finished <<< pendulum_msgs [0.06s]
Starting >>> pendulum_control
Finished <<< pendulum_control [10.3s]            

Summary: 2 packages finished [10.8s]
$ colcon test-result --verbose
Summary: 51 tests, 0 errors, 0 failures, 4 skipped

It looks like the test is failing in its setup, before it even gets to RMW stuff:

[pendulum_demo-2] Couldn't set scheduling priority and policy: Operation not permitted
[pendulum_demo-2] mlockall failed: Cannot allocate memory
[pendulum_demo-2] Couldn't lock all cached virtual memory.
[pendulum_demo-2] Pagefaults from reading pages not yet mapped into RAM will be recorded.

Also, this is mathematically impossible, so I'm guessing the test is broken in other ways:

[pendulum_logger-1] Mean latency: 4458224516038787409289673575093816669551304146650037671236339690561781237058564279746291212376660112936812899278369000229134709514347391053183640763362416971774278560649182152566312699581486525144027387385646723574136033038253156096752904762288198647808.000000 ns
[pendulum_logger-1] Min latency: 1835355439 ns
[pendulum_logger-1] Max latency: 1047684450 ns

rotu · 2019-07-24T20:35:56Z

Identified one issue: ros2/rosidl#395

rotu · 2019-07-24T23:43:43Z

Alright I'm able to sometimes produce a failure. I think the issue is that rttest is not ~~typesafe~~ threadsafe. Trying to update the process priority might cover up some of these errors, but rttest shares statistics objects between threads without locking, and there is potential for non-atomic changes to it (I think that's the weirdness with mean latency).

rotu · 2019-08-05T14:50:31Z

@clalancette Are you still concerned about the CI status?

clalancette · 2019-08-06T18:23:49Z

Here's a new CI on Linux amd64, just to see how things go. If this goes well, I'll do another CI run on macOS as well (I'm reluctant to do it at the moment since our macOS capacity is low):

Linux

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second. Signed-off-by: Dan Rose <dan@digilabs.io>

rotu · 2019-08-06T18:42:11Z

@clalancette needed changes from 7b46150
Rebased and pushed. Please rerun CI.

clalancette · 2019-08-06T18:49:39Z

Linux

rotu · 2019-08-06T20:29:35Z

Blech. This same darn test failing. Does this tend to pass reliably before this PR?

clalancette · 2019-08-06T20:31:57Z

Blech. This same darn test failing. Does this tend to pass reliably before this PR?

So this is the thing; looking at https://ci.ros2.org/view/nightly/job/nightly_linux_debug/ , I don't see that test failing for at least the last 10 nightlies (I stopped looking after that). So it does suggest to me that this PR is causing it, but I don't really know why.

rotu · 2019-08-06T20:34:11Z

I'm also seeing mathematically impossible results. Standard deviation can't be higher than the maximum absolution deviation (max-min).

[pendulum_demo-2] - Min: 21879 ns
[pendulum_demo-2] - Max: 18772095 ns
[pendulum_demo-2] - Mean: 2.56384e+06 ns
[pendulum_demo-2] - Standard deviation: 1.35819e+08

clalancette · 2019-08-06T20:38:13Z

I'm also seeing mathematically impossible results. Standard deviation can't higher than the maximum absolution deviation (max-min).

Yeah, those are clearly bogus (probably uninitialized memory). I took a quick look, and it wasn't obvious that it was uninitialized, but I didn't have time today to really dig into it.

clalancette · 2019-08-09T15:23:57Z

Yeah, those are clearly bogus (probably uninitialized memory). I took a quick look, and it wasn't obvious that it was uninitialized, but I didn't have time today to really dig into it.

All right. I think that the mathematically impossible results are because of two different bugs in the pendulum stuff.

There's some potential for uninitialized memory. https://github.com/ros2/demos/blob/master/pendulum_control/src/pendulum_demo.cpp#L249 is getting called at some fixed frequency, but the very first time it calls set_rtt_results_message, there's no guarantee that the results structure has any valid data filled out yet. I'll (eventually) open a PR to initialize results to zero.
There's a type issue between the {min,max}_latency as reported by realtime_support, and how we send it over the network. In https://github.com/ros2/realtime_support/blob/master/rttest/include/rttest/rttest.h#L45 , min and max latency are defined as integers. However, when we go to publish them during the pendulum demo, they get shoved into uint64 (https://github.com/ros2/demos/blob/master/pendulum_msgs/msg/RttestResults.msg#L8). Having latency be able to be negative is a bit weird, but seems to be intended by https://github.com/ros2/realtime_support/blob/master/rttest/src/rttest.cpp#L224 . I have another local change to switch that to an int64.

With both of those fixes in place, I now get to a place where I sometimes see min_latency as 2^31-1 and max_latency as -2^31 on the very first iteration, which should be consistent even if it is somewhat weird. We probably still need to rethink what is going on there, since that really shouldn't report anything unless there is something to report.

I'll continue to poke at the bugs in the pendulum test, but I highly doubt the fixes will have any impact on these test failures.

clalancette · 2019-08-09T18:25:31Z

See ros2/realtime_support#81 and ros2/demos#385 for some fixes to the pendulum_control demos.

rotu · 2019-08-10T22:49:13Z

Thank you for looking into that, @clalancette. I'm trying to compare this to the test results where this last succeeded, but it looks like it doesn't run every build and NEARLY EVERY RUN is failing, even before I touched this code. I think this test is faulty.

https://ci.ros2.org/job/ci_linux/7474/testReport/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/

https://ci.ros2.org/job/ci_linux/7812/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/history/?start=400

https://ci.ros2.org/job/ci_linux/7812/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/history/?start=425

https://ci.ros2.org/job/ci_linux/7812/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/history/?start=475

rotu · 2019-08-10T22:53:49Z

Also, if you know a way to bring up the entire test history for "projectroot.test_pendulum__rmw_opensplice_cpp.test_pendulum__rmw_opensplice_cpp", I'd like to know. I don't understand Jenkins' interface at all

clalancette · 2019-08-12T11:58:24Z

Hm, interesting that it is failing on CI jobs, but not on the nightlies. Maybe it is pretty flaky on amd64 as well, and the nightlies are just lucky? I'm not sure.

Also, if you know a way to bring up the entire test history for "projectroot.test_pendulum__rmw_opensplice_cpp.test_pendulum__rmw_opensplice_cpp", I'd like to know. I don't understand Jenkins' interface at all

The Jenkins UI is a mystery to me too, so I doubt I'll be a help here.

That all being said, we can find out for sure whether my changes in the pendulum_control code makes a difference here. I'm going to launch CI with these changes plus those changes to see what happens.

clalancette · 2019-08-12T12:01:58Z

Linux
Linux-aarch64
macOS
Windows

rotu · 2019-08-12T13:21:01Z

I’m pleasantly surprised it passed! Judging by the output, it looks like aarch64’s failures are also due to uninitialized memory.

clalancette · 2019-08-12T13:33:09Z

I’m pleasantly surprised it passed! Judging by the output, it looks like aarch64’s failures are also due to uninitialized memory.

That....shouldn't have happened with my latest patches. Those patches make it so that no data is published until it has been initialized at least once. That suggests that there is another uninitialized piece of memory in there, but I didn't see any of them when I was looking. Also, I didn't see those problems when I ran CI on my other patches, but maybe it is random.

In any case, I think it is clear that this PR isn't causing the problem (maybe just exacerbating it). I'm going to merge this one, then get my other fixes reviewed and merged. There is probably additional follow-up work to do on it then.

clalancette · 2019-08-13T13:44:58Z

I had a dumb bug in my patches to fix the uninitialized memory. I'm re-running CI there (which will obviously include this change), but I'm fairly sure that will fix the problem now.

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second. Signed-off-by: Dan Rose <dan@digilabs.io>

rotu force-pushed the disable_data_multicast branch from af57b2b to 5e21732 Compare June 27, 2019 18:19

nuclearsandwich assigned nuclearsandwich and wjwwood and unassigned nuclearsandwich Jul 11, 2019

wjwwood removed their assignment Jul 11, 2019

nuclearsandwich assigned clalancette Jul 11, 2019

clalancette approved these changes Jul 24, 2019

View reviewed changes

Change AllowMulticast to spdp

e4a5ba8

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second. Signed-off-by: Dan Rose <dan@digilabs.io>

rotu force-pushed the disable_data_multicast branch from 5e21732 to e4a5ba8 Compare August 6, 2019 18:41

clalancette merged commit ccfd3a8 into ros2:master Aug 12, 2019

rotu mentioned this pull request Aug 12, 2019

test_pendulum failing with nonsensical statistics on linux-aarch64 ros2/demos#386

Closed

rotu mentioned this pull request Aug 30, 2019

Dashing backport of PR 34 #41

Merged

clalancette mentioned this pull request Sep 3, 2019

Dashing Patch Release 3 ros2/ros2#753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change AllowMulticast to spdp #34

Change AllowMulticast to spdp #34

rotu commented Jun 27, 2019

rotu commented Jun 29, 2019 •

edited

wjwwood commented Jul 11, 2019

clalancette commented Jul 16, 2019

cwyark commented Jul 17, 2019

clalancette commented Jul 17, 2019

rotu commented Jul 17, 2019

cwyark commented Jul 24, 2019

clalancette commented Jul 24, 2019

clalancette commented Jul 24, 2019

clalancette commented Jul 24, 2019

rotu commented Jul 24, 2019

rotu commented Jul 24, 2019

rotu commented Jul 24, 2019 •

edited

rotu commented Aug 5, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019 •

edited

clalancette commented Aug 6, 2019

clalancette commented Aug 9, 2019

clalancette commented Aug 9, 2019

rotu commented Aug 10, 2019

rotu commented Aug 10, 2019

clalancette commented Aug 12, 2019

clalancette commented Aug 12, 2019

rotu commented Aug 12, 2019

clalancette commented Aug 12, 2019

clalancette commented Aug 13, 2019 •

edited

Change AllowMulticast to spdp #34

Change AllowMulticast to spdp #34

Conversation

rotu commented Jun 27, 2019

rotu commented Jun 29, 2019 • edited

wjwwood commented Jul 11, 2019

clalancette commented Jul 16, 2019

cwyark commented Jul 17, 2019

clalancette commented Jul 17, 2019

rotu commented Jul 17, 2019

cwyark commented Jul 24, 2019

clalancette commented Jul 24, 2019

clalancette commented Jul 24, 2019

clalancette commented Jul 24, 2019

rotu commented Jul 24, 2019

rotu commented Jul 24, 2019

rotu commented Jul 24, 2019 • edited

rotu commented Aug 5, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019

clalancette commented Aug 6, 2019

rotu commented Aug 6, 2019 • edited

clalancette commented Aug 6, 2019

clalancette commented Aug 9, 2019

clalancette commented Aug 9, 2019

rotu commented Aug 10, 2019

rotu commented Aug 10, 2019

clalancette commented Aug 12, 2019

clalancette commented Aug 12, 2019

rotu commented Aug 12, 2019

clalancette commented Aug 12, 2019

clalancette commented Aug 13, 2019 • edited

rotu commented Jun 29, 2019 •

edited

rotu commented Jul 24, 2019 •

edited

rotu commented Aug 6, 2019 •

edited

clalancette commented Aug 13, 2019 •

edited