Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flex - powershell.exe 100% CPU usage #349

Open
froazin opened this issue Mar 31, 2022 · 0 comments
Open

Flex - powershell.exe 100% CPU usage #349

froazin opened this issue Mar 31, 2022 · 0 comments
Labels
triage/pending Issue or PR is pending for triage and prioritization.

Comments

@froazin
Copy link

froazin commented Mar 31, 2022

Description

We have picked up on an issue whereby the instantiation of powershell.exe in Windows Server environments by nri-flex.exe is consuming 100% CPU for the duration of the powershell.exe process. This is preventing other critical processes from running during this time window as the powershell.exe process seems to be given priority over other processes. However, when inspected, the powershell.exe process is running with 'normal' priority on the affected system.

Note: This behaviour is observed on servers of various performance specifications. However, it is only a problem on servers that are either already under high load or only have one or two CPUs. Allocated memory does not seem to be a significant factor.

Work Around 1

Using the Windows Services Flex example, a quick and dirty work around seems to be adding a “(Get-Process -Id $pid).PriorityClass = 'Idle';” onto the start of the flex commands like so:

integrations:
  - name: nri-flex
    config:
      name: winServiceStatus
      lookup_file: "C:/Program Files/New Relic/newrelic-infra/integrations.d/flexAssets/windows-service-status-filtered-lookup.json"
      apis:
        - event_type: winServiceStatus
          shell: powershell
          commands:
            - run: "(Get-Process -Id $pid).PriorityClass = 'Idle'; Get-Service | Select-Object -Property {..........}"

The CPU usage still spikes up to 100% but then graciously gives way to other processes running on the host when needed. However, as a long-term fix, this isn’t ideal. After a few stress tests, it seems that the Flex will time out before returning any data when the server is under load. While it might be possible to increase the time out window on Flex integrations, this will have an inherent risk whereby the necessary increase in flex interval may lead to critical monitoring data being missed in more sensitive environments.

Work Around 2

This might also be potentially a separate issue in and of itself but can also be a work around none-the-less. Since nri-flex.exe seems to run multiple consecutive powershell.exe processes, one after the other, contributing to issue as the starting and stopping of many process in a short window has effects in and of itself.

Using the Windows Services example Flex in this case again: It is possible to refence a lookup.json file directly from inside a PowerShell script, then using a JSON array, converted into a hash table, batch all Windows Service status checks into one powershell.exe process, rather than launching multiple, optionally be combined with Work Around 1. Like so:


newrelic-infra-flex-windows-services.yml

integrations:
  - name: nri-flex
    interval: 300s
    config:
      name: winServiceStatus
      apis:
        - event_type: winServiceStatus
          shell: powershell
          commands:
            - run: "& \"C:/Program Files/New Relic/newrelic-infra/integrations.d/scripts/Get-WindowsServiceStatus.ps1\" -ConfigFilePath \"C:/Program Files/New Relic/newrelic-infra/integrations.d/json/Get-WindowsServiceStatus-Config.json\""

Get-WindowsServiceStatus.ps1

[CmdletBinding()]
param (
    [Parameter(
        Mandatory = $true,
        ValueFromPipeline = $true,
        ValueFromPipelineByPropertyName = $true
    )]
    [string]
    $ConfigFilePath
)

begin {
    # import config and convert to hashtable
    $Config = Get-Content -Path $ConfigFilePath -Raw -ErrorAction Stop  | ConvertFrom-Json

    # set script priority
    # wait for change to take effect before continuing
    :priority for ($i = 0; $i -le 3; $i++) {
        (Get-Process -Id $pid).PriorityClass = $Config.PriorityClass.ToString()
        [System.Threading.Thread]::Sleep(50)

        if ((Get-Process -Id $pid).PriorityClass -notlike $Config.PriorityClass.ToString()) {
            if ($i -eq 3) {
                exit $null
            } else {
                continue :priority
            }
        } else {
            break :priority
        }
    }

    # check servicename is not empty
    if (-not $Config.ServiceName) {
        exit $null
    }

    # prepare output varible
    $out = [System.Collections.ArrayList]@{}
}

process {
    # collect information on provided windows service names
    $services = Get-Service -Name $Config.ServiceName -ErrorAction SilentlyContinue

    # construct output for each service collected
    foreach ($service in $services) {
        $out += [PSCustomObject]@{
            ServiceName = $service.ServiceName.ToString()
            ServiceDisplayName = $service.DisplayName.ToString()
            ServiceStatus = $service.Status.ToString()
        }
    }
}

end {
    return $out | ConvertTo-Json -Compress
} 

Get-WindowsServiceStatus-Config.json

{
    "ServiceName":[
        "Service1",
        "Service2",
        "Service3",
        "Service4"
    ],

    "PriorityClass":"Idle"
}

Though this approach still doesn’t prevent PowerShell from consuming 100% CPU, it does drastically reduce the amount of time spent at 100%, and when combined with the first workaround option does also allow for graciously giving way to other processes running on the host.

It doesn't seem to be possible to reference a JSON array using the usual ${lf:variable} approach. When tested this seems to concatenate the items listed in the JSON array onto one, long string. As such it is necessary in this approach to reference the JSON directly within the script.

Risks

For both work arounds, more sensitive environments risk missing data due to increased timeout windows, and subsequent delays in alerting time. This becomes even more mission critical when generated alerts are used as triggers for automated remediation processes or other time sensitive tasks.

Expected Behavior

Instantiated powershell.exe processes triggered by nri-flex.exe should graciously give way to critical processes running on the host in such a manor that doesn't risk the flex timing out during periods of prolonged high load leading to false positive alerts being generated.

Troubleshooting or NR Diag results

See workarounds above.

Steps to Reproduce

  1. Install New Relic Infrastructure agent on Windows Server 2012/19 VM with one or two vCPU's. (confirmed on agent versions 1.23.1 and 1.24.0 haven't yet checked older versions)
  2. Copy down and configure any of the Windows PowerShell related Flex integrations. (Ones with associated lookup files and multiple listed items in the lookup file are the best illustration of the issue.)
  3. Run and observe CPU spikes of 100% and a significant drop in performance on the host for the duration of the spike.

Your Environment

  • Agent version 1.23.1 & 1.24.0 confirmed.
  • Windows Server VMs running on VMWare Infrastructure.
  • Observed on all CPU core counts but only really problematic where only one or two CPUs are available to the OS, though there are exceptions and we have seen had problems with hosts with as much as 4 vCPUs.
  • Windows Server Datacenter 2012 R2 through Windows Server Datacenter 2019.

Additional context

Was directed to this issue board by a New Relic representative.

@davidgit davidgit added the triage/pending Issue or PR is pending for triage and prioritization. label Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/pending Issue or PR is pending for triage and prioritization.
Projects
None yet
Development

No branches or pull requests

2 participants