Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CPU usage check for HAWK/PUMA #12464

Merged
merged 2 commits into from May 5, 2021

Conversation

alvarocarvajald
Copy link
Contributor

This PR adds a CPU usage check on the ha/check_hawk test module while a client running the ha/hawk_gui test module is interacting with HAWK. It will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount of CPU) if HAWK/PUMA CPU usage is over 50%.

This adds a CPU usage check on the ha/check_hawk test module while a
client running the ha/hawk_gui test module is interacting with HAWK. It
will soft fail with bsc#1179609 (HAWK/PUMA consume a considerable amount
of CPU) if HAWK/PUMA CPU usage is over 50%.
Copy link
Contributor

@ricardobranco777 ricardobranco777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

@ldevulder
Copy link
Member

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

I don't see why this shouldn't be in openQA. FMPOV the test code is not so complicated and moreover the test code in openQA doesn't have to be simple, it has to be readable which is the case here. And as most of our test codes are in openQA is better to have it here IMHO.

@alvarocarvajald
Copy link
Contributor Author

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

We need to check the CPU usage in the server. hawk_test runs on the client side.

@ricardobranco777
Copy link
Contributor

I don't see why this shouldn't be in openQA. FMPOV the test code is not so complicated and moreover the test code in openQA doesn't have to be simple, it has to be readable which is the case here. And as most of our test codes are in openQA is better to have it here IMHO.

It's much cleaner and easier to add this test (and, for that matter, any other test) to the hawk_test test suite and let openQA just manage the nodes and run the test suite. Having the tests split between openQA and hawk_test is not desirable. It's much better to run all the tests in the test suite.

I vote for leave to openQA only the tests that can't be run on hawk_test.

@ricardobranco777
Copy link
Contributor

ricardobranco777 commented May 3, 2021

I believe this test should be added to hawk_test instead of openQA. It keeps our openQA code simpler and leaves all the tests to the hawk_test suite instead.

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can run ps through ssh :)

@ldevulder
Copy link
Member

We can't run ps through ssh :)

And you call this "simpler"?

@ricardobranco777
Copy link
Contributor

We can't run ps through ssh :)

And you call this "simpler"?

We already run commands in hawk_test through ssh and in Python is simpler :)

@alvarocarvajald
Copy link
Contributor Author

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can't run ps through ssh :)

Yeah, that sounds like an over-complication. I would vote against that.

@ricardobranco777
Copy link
Contributor

We need to check the CPU usage in the server. hawk_test runs on the client side.

We can't run ps through ssh :)

Yeah, that sounds like an over-complication. I would vote against that.

Sorry: s/can't/can/

Copy link
Member

@ldevulder ldevulder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alvarocarvajald alvarocarvajald dismissed ricardobranco777’s stale review May 3, 2021 16:09

Change requested would move commands from the server to the client and unnecessarily complicate the solution.

@alvarocarvajald
Copy link
Contributor Author

and in Python is simpler :)

That's cute. ;)

@ricardobranco777
Copy link
Contributor

ricardobranco777 commented May 3, 2021

I'm proposing adding this test (and any other test on the server) to hawk_test like this:

ricardobranco777/hawk_test#9

Missing:

  • Handle check_cpu() return value
  • soft-fail when cpu_total is greater than a configurable(?) value.

Will provide a verification run once we agree on the method to signal the bug.

Benefits:

  • Having all Hawk test code in hawk_test
  • We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

And finally, when it's merged, we should monitor the CPU usage in every test in past SLES versions to arrive at a decent default value for cpu_total.

Copy link
Contributor

@juadk juadk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldevulder
Copy link
Member

  • Having all Hawk test code in hawk_test

Then remove all openQA code and execute the test outside of openQA, you will have ALL test code in hawk_test...

  • We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

A TLS handshakes means a failure from a customer point of view?

@ricardobranco777
Copy link
Contributor

  • Having all Hawk test code in hawk_test

Then remove all openQA code and execute the test outside of openQA, you will have ALL test code in hawk_test...

  • We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

A TLS handshakes means a failure from a customer point of view?

+50% CPU utilization can happen with TLS handshakes or any other operation. That's why we should test CPU utilization after each test in Hawk. It's the best way to test bsc#1179651

@alvarocarvajald
Copy link
Contributor Author

A TLS handshakes means a failure from a customer point of view?

+50% CPU utilization can happen with TLS handshakes or any other operation. That's why we should test CPU utilization after each test in Hawk. It's the best way to test bsc#1179651

I respectfully disagree. I think in this manner we are actually increasing the possibilities of false positives and negatives both, and we effectively reduce that by measuring the CPU usage during a longer period of time and working on averages.

Please read the description of the test case in https://confluence.suse.com/pages/viewpage.action?pageId=634290489 (linked on the Jira ticket). It explicitly recommends against taking a single measurement after "stressing" HAWK, which is what the proposed change in hawk_test would do: login to HAWK -> interact a bit -> logout -> single CPU usage measurement after each test over SSH.

What I have included in this PR is not exactly the same as described in confluence as the measurements are being taken in a longer period of time (more or less 5 to 10 minutes, instead of 1 minute), but it is covering with more measurements a longer interaction with HAWK.

I understand you're worried that on some of those measurements we could pick expected high CPU usage due to TLS hand-shakes, and that this could lead to false positives, but as you can see from the verification runs, if this is happening, it is being handled by the average operation. Recorded CPU usages seen so far in the verification runs are in the 7.8% to 13.6% range.

I could increase the comparison threshold from 50% to 60% if we agree on this as a second measure to avoid false positives, but IMHO, if we see a test spiking from 13.6% CPU usage to more than 50%, this would require an investigation ... false positive or not.

As a compromise we could trigger a verification run with both approaches and check how different are the results.

@alvarocarvajald
Copy link
Contributor Author

alvarocarvajald commented May 4, 2021

I'm proposing adding this test (and any other test on the server) to hawk_test like this:

ricardobranco777/hawk_test#9

Missing:

  • Handle check_cpu() return value
  • soft-fail when cpu_total is greater than a configurable(?) value.
    Will provide a verification run once we agree on the method to signal the bug.

If we will not get a verification run for this shortly, I propose merging this PR as is, and then rolling back the changes after we determine that both approaches are the same. If we determine that both approaches are complimentary (which is my gut feeling ATM), we keep both.

Benefits:

  • Having all Hawk test code in hawk_test

This is the benefit, as this test could be used by other teams outside of openQA (for example as a CI/CD tool in ClusterLabs repositories) to check regressions to HAWK more thoroughly.

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

@ricardobranco777
Copy link
Contributor

Please read the description of the test case in https://confluence.suse.com/pages/viewpage.action?pageId=634290489 (linked on the Jira ticket). It explicitly recommends against taking a single measurement after "stressing" HAWK, which is what the proposed change in hawk_test would do: login to HAWK -> interact a bit -> logout -> single CPU usage measurement after each test over SSH.

I can add a sleep after the logout. From the description of bsc#1179651, once the CPU problem begins it doesn't go away. So I think that just one final measurement after all the Hawk tests would be enough.

I've been reading bug reports about CPU usage in web frameworks for both Python & Ruby (Django and Puma, respectively) and high CPU utilization can also mean 10% CPU usage when idle. Greater than that may happen when too many persistent connections weren't closed on the server:

@ricardobranco777
Copy link
Contributor

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

 1

I'm proposing adding this test (and any other test on the server) to hawk_test like this:
ricardobranco777/hawk_test#9
Missing:

  • Handle check_cpu() return value
  • soft-fail when cpu_total is greater than a configurable(?) value.
    Will provide a verification run once we agree on the method to signal the bug.

If we will not get a verification run for this shortly, I propose merging this PR as is, and then rolling back the changes after we determine that both approaches are the same. If we determine that both approaches are complimentary (which is my gut feeling ATM), we keep both.

Benefits:

  • Having all Hawk test code in hawk_test

This is the benefit, as this test could be used by other teams outside of openQA (for example as a CI/CD tool in ClusterLabs repositories) to check regressions to HAWK more thoroughly.

We avoid false positives by checking CPU only after each operation. Otherwise false positives may happen on TLS handshakes, etc.

See my other message. I think we avoid false positives related to TLS hand-shakes only, but we may introduce a new set of both false positives and false negatives by taking fewer measurements.

Consider that according to the bug description, the CPU problem doesn't go away once it begins, so I don't see the possibility of introducing false positives or false negatives, provided that we check the CPU some time after logout

@alvarocarvajald
Copy link
Contributor Author

I can add a sleep after the logout. From the description of bsc#1179651, once the CPU problem begins it doesn't go away. So I think that just one final measurement after all the Hawk tests would be enough.

To detect this specific bug, I agree. To detect other issues related to CPU usage it can be lacking.

I've been reading bug reports about CPU usage in web frameworks for both Python & Ruby (Django and Puma, respectively) and high CPU utilization can also mean 10% CPU usage when idle. Greater than that may happen when too many persistent connections weren't closed on the server:

Then it may be relevant to check CPU usage against this 10% threshold even before we have the connection from the client.

After the current verification runs finish, I'll see how to add it in this PR.

@alvarocarvajald
Copy link
Contributor Author

Added code to also check CPU usage while HAWK is idle, as suggested by @ricardobranco777, so please @juadk @ldevulder could you review again?

Verfication runs are at: node 1, node 2, client & support server

@alvarocarvajald alvarocarvajald removed the WIP Work in progress label May 5, 2021
@juadk juadk self-requested a review May 5, 2021 08:52
Copy link
Contributor

@juadk juadk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point also checking the CPU when hawk is idle, I will keep an eye on the test results in all of the maintained OS versions. LGTM

Copy link
Member

@ldevulder ldevulder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldevulder ldevulder merged commit 4654ede into os-autoinst:master May 5, 2021
@alvarocarvajald alvarocarvajald deleted the hawk-cpu-test branch May 5, 2021 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants