Add a CPU usage check for HAWK/PUMA #12464

alvarocarvajald · 2021-05-03T15:23:32Z

This PR adds a CPU usage check on the ha/check_hawk test module while a client running the ha/hawk_gui test module is interacting with HAWK. It will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount of CPU) if HAWK/PUMA CPU usage is over 50%.

Related ticket: https://jira.suse.com/browse/TEAM-2801
Related bugs: https://bugzilla.suse.com/show_bug.cgi?id=1179609 & https://bugzilla.suse.com/show_bug.cgi?id=1179651
Needles: N/A
Verification runs:
15-SP2: node 1, node 2, client, support server
15-SP3: node 1, node 2, client, support server
(failures in 15-SP3 are due to bsc#1184274 and unrelated to this PR)

This adds a CPU usage check on the ha/check_hawk test module while a client running the ha/hawk_gui test module is interacting with HAWK. It will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount of CPU) if HAWK/PUMA CPU usage is over 50%.

ricardobranco777

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

ldevulder · 2021-05-03T15:54:36Z

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

I don't see why this shouldn't be in openQA. FMPOV the test code is not so complicated and moreover the test code in openQA doesn't have to be simple, it has to be readable which is the case here. And as most of our test codes are in openQA is better to have it here IMHO.

alvarocarvajald · 2021-05-03T16:00:32Z

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

We need to check the CPU usage in the server. hawk_test runs on the client side.

ricardobranco777 · 2021-05-03T16:00:44Z

I don't see why this shouldn't be in openQA. FMPOV the test code is not so complicated and moreover the test code in openQA doesn't have to be simple, it has to be readable which is the case here. And as most of our test codes are in openQA is better to have it here IMHO.

It's much cleaner and easier to add this test (and, for that matter, any other test) to the hawk_test test suite and let openQA just manage the nodes and run the test suite. Having the tests split between openQA and hawk_test is not desirable. It's much better to run all the tests in the test suite.

I vote for leave to openQA only the tests that can't be run on hawk_test.

ricardobranco777 · 2021-05-03T16:01:39Z

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can run ps through ssh :)

ldevulder · 2021-05-03T16:02:09Z

We can't run ps through ssh :)

And you call this "simpler"?

ricardobranco777 · 2021-05-03T16:02:53Z

We can't run ps through ssh :)

And you call this "simpler"?

We already run commands in hawk_test through ssh and in Python is simpler :)

alvarocarvajald · 2021-05-03T16:02:58Z

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can't run ps through ssh :)

Yeah, that sounds like an over-complication. I would vote against that.

ricardobranco777 · 2021-05-03T16:03:46Z

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can't run ps through ssh :)

Yeah, that sounds like an over-complication. I would vote against that.

Sorry: s/can't/can/

ldevulder

LGTM

Change requested would move commands from the server to the client and unnecessarily complicate the solution.

alvarocarvajald · 2021-05-03T16:21:01Z

and in Python is simpler :)

That's cute. ;)

ricardobranco777 · 2021-05-03T18:22:40Z

I'm proposing adding this test (and any other test on the server) to hawk_test like this:

ricardobranco777/hawk_test#9

Missing:

Handle check_cpu() return value
soft-fail when cpu_total is greater than a configurable(?) value.

Will provide a verification run once we agree on the method to signal the bug.

Benefits:

Having all Hawk test code in hawk_test
We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

And finally, when it's merged, we should monitor the CPU usage in every test in past SLES versions to arrive at a decent default value for cpu_total.

juadk

LGTM

ldevulder · 2021-05-04T07:32:13Z

Having all Hawk test code in hawk_test

Then remove all openQA code and execute the test outside of openQA, you will have ALL test code in hawk_test...

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

A TLS handshakes means a failure from a customer point of view?

ricardobranco777 · 2021-05-04T07:40:39Z

Having all Hawk test code in hawk_test

Then remove all openQA code and execute the test outside of openQA, you will have ALL test code in hawk_test...

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

A TLS handshakes means a failure from a customer point of view?

+50% CPU utilization can happen with TLS handshakes or any other operation. That's why we should test CPU utilization after each test in Hawk. It's the best way to test bsc#1179651

alvarocarvajald · 2021-05-04T09:37:14Z

A TLS handshakes means a failure from a customer point of view?

+50% CPU utilization can happen with TLS handshakes or any other operation. That's why we should test CPU utilization after each test in Hawk. It's the best way to test bsc#1179651

I respectfully disagree. I think in this manner we are actually increasing the possibilities of false positives and negatives both, and we effectively reduce that by measuring the CPU usage during a longer period of time and working on averages.

Please read the description of the test case in https://confluence.suse.com/pages/viewpage.action?pageId=634290489 (linked on the Jira ticket). It explicitly recommends against taking a single measurement after "stressing" HAWK, which is what the proposed change in hawk_test would do: login to HAWK -> interact a bit -> logout -> single CPU usage measurement after each test over SSH.

What I have included in this PR is not exactly the same as described in confluence as the measurements are being taken in a longer period of time (more or less 5 to 10 minutes, instead of 1 minute), but it is covering with more measurements a longer interaction with HAWK.

I understand you're worried that on some of those measurements we could pick expected high CPU usage due to TLS hand-shakes, and that this could lead to false positives, but as you can see from the verification runs, if this is happening, it is being handled by the average operation. Recorded CPU usages seen so far in the verification runs are in the 7.8% to 13.6% range.

I could increase the comparison threshold from 50% to 60% if we agree on this as a second measure to avoid false positives, but IMHO, if we see a test spiking from 13.6% CPU usage to more than 50%, this would require an investigation ... false positive or not.

As a compromise we could trigger a verification run with both approaches and check how different are the results.

alvarocarvajald · 2021-05-04T09:44:18Z

I'm proposing adding this test (and any other test on the server) to hawk_test like this:

ricardobranco777/hawk_test#9

Missing:

Handle check_cpu() return value

soft-fail when cpu_total is greater than a configurable(?) value.
Will provide a verification run once we agree on the method to signal the bug.

If we will not get a verification run for this shortly, I propose merging this PR as is, and then rolling back the changes after we determine that both approaches are the same. If we determine that both approaches are complimentary (which is my gut feeling ATM), we keep both.

Benefits:

Having all Hawk test code in hawk_test

This is the benefit, as this test could be used by other teams outside of openQA (for example as a CI/CD tool in ClusterLabs repositories) to check regressions to HAWK more thoroughly.

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

ricardobranco777 · 2021-05-04T10:12:21Z

Please read the description of the test case in https://confluence.suse.com/pages/viewpage.action?pageId=634290489 (linked on the Jira ticket). It explicitly recommends against taking a single measurement after "stressing" HAWK, which is what the proposed change in hawk_test would do: login to HAWK -> interact a bit -> logout -> single CPU usage measurement after each test over SSH.

I can add a sleep after the logout. From the description of bsc#1179651, once the CPU problem begins it doesn't go away. So I think that just one final measurement after all the Hawk tests would be enough.

I've been reading bug reports about CPU usage in web frameworks for both Python & Ruby (Django and Puma, respectively) and high CPU utilization can also mean 10% CPU usage when idle. Greater than that may happen when too many persistent connections weren't closed on the server:

ricardobranco777 · 2021-05-04T10:20:21Z

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

1

I'm proposing adding this test (and any other test on the server) to hawk_test like this:
ricardobranco777/hawk_test#9
Missing:

Handle check_cpu() return value

soft-fail when cpu_total is greater than a configurable(?) value.
Will provide a verification run once we agree on the method to signal the bug.

If we will not get a verification run for this shortly, I propose merging this PR as is, and then rolling back the changes after we determine that both approaches are the same. If we determine that both approaches are complimentary (which is my gut feeling ATM), we keep both.

Benefits:

Having all Hawk test code in hawk_test

This is the benefit, as this test could be used by other teams outside of openQA (for example as a CI/CD tool in ClusterLabs repositories) to check regressions to HAWK more thoroughly.

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

Consider that according to the bug description, the CPU problem doesn't go away once it begins, so I don't see the possibility of introducing false positives or false negatives, provided that we check the CPU some time after logout

alvarocarvajald · 2021-05-04T14:28:46Z

I can add a sleep after the logout. From the description of bsc#1179651, once the CPU problem begins it doesn't go away. So I think that just one final measurement after all the Hawk tests would be enough.

To detect this specific bug, I agree. To detect other issues related to CPU usage it can be lacking.

I've been reading bug reports about CPU usage in web frameworks for both Python & Ruby (Django and Puma, respectively) and high CPU utilization can also mean 10% CPU usage when idle. Greater than that may happen when too many persistent connections weren't closed on the server:

Then it may be relevant to check CPU usage against this 10% threshold even before we have the connection from the client.

After the current verification runs finish, I'll see how to add it in this PR.

alvarocarvajald · 2021-05-05T08:37:47Z

Added code to also check CPU usage while HAWK is idle, as suggested by @ricardobranco777, so please @juadk @ldevulder could you review again?

Verfication runs are at: node 1, node 2, client & support server

juadk

Good point also checking the CPU when hawk is idle, I will keep an eye on the test results in all of the maintained OS versions. LGTM

ldevulder

LGTM

alvarocarvajald requested review from ldevulder, ricardobranco777 and juadk May 3, 2021 15:23

ricardobranco777 previously requested changes May 3, 2021

View reviewed changes

alvarocarvajald requested a review from ricardobranco777 May 3, 2021 16:00

ldevulder approved these changes May 3, 2021

View reviewed changes

alvarocarvajald added the WIP Work in progress label May 3, 2021

ricardobranco777 mentioned this pull request May 3, 2021

Add CPU check after each test ricardobranco777/hawk_test#9

Open

juadk approved these changes May 3, 2021

View reviewed changes

Add CPU usage check when HAWK is idle

00350f8

alvarocarvajald removed the WIP Work in progress label May 5, 2021

juadk self-requested a review May 5, 2021 08:52

juadk approved these changes May 5, 2021

View reviewed changes

ldevulder approved these changes May 5, 2021

View reviewed changes

ldevulder merged commit 4654ede into os-autoinst:master May 5, 2021

alvarocarvajald deleted the hawk-cpu-test branch May 5, 2021 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CPU usage check for HAWK/PUMA #12464

Add a CPU usage check for HAWK/PUMA #12464

alvarocarvajald commented May 3, 2021

ricardobranco777 left a comment

ldevulder commented May 3, 2021

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021

ricardobranco777 commented May 3, 2021 •

edited

ldevulder commented May 3, 2021

ricardobranco777 commented May 3, 2021

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021

ldevulder left a comment

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021 •

edited

juadk left a comment

ldevulder commented May 4, 2021

ricardobranco777 commented May 4, 2021

alvarocarvajald commented May 4, 2021

alvarocarvajald commented May 4, 2021 •

edited

ricardobranco777 commented May 4, 2021

ricardobranco777 commented May 4, 2021

alvarocarvajald commented May 4, 2021

alvarocarvajald commented May 5, 2021

juadk left a comment

ldevulder left a comment

Add a CPU usage check for HAWK/PUMA #12464

Add a CPU usage check for HAWK/PUMA #12464

Conversation

alvarocarvajald commented May 3, 2021

ricardobranco777 left a comment

Choose a reason for hiding this comment

ldevulder commented May 3, 2021

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021

ricardobranco777 commented May 3, 2021 • edited

ldevulder commented May 3, 2021

ricardobranco777 commented May 3, 2021

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021

ldevulder left a comment

Choose a reason for hiding this comment

alvarocarvajald commented May 3, 2021

ricardobranco777 commented May 3, 2021 • edited

juadk left a comment

Choose a reason for hiding this comment

ldevulder commented May 4, 2021

ricardobranco777 commented May 4, 2021

alvarocarvajald commented May 4, 2021

alvarocarvajald commented May 4, 2021 • edited

ricardobranco777 commented May 4, 2021

ricardobranco777 commented May 4, 2021

alvarocarvajald commented May 4, 2021

alvarocarvajald commented May 5, 2021

juadk left a comment

Choose a reason for hiding this comment

ldevulder left a comment

Choose a reason for hiding this comment

ricardobranco777 commented May 3, 2021 •

edited

ricardobranco777 commented May 3, 2021 •

edited

alvarocarvajald commented May 4, 2021 •

edited