Skip to content

Latest commit

 

History

History
374 lines (287 loc) · 24.2 KB

check_vmware_datastore_performance.md

File metadata and controls

374 lines (287 loc) · 24.2 KB

check-vmware | check_vmware_datastore_performance plugin

Table of Contents

Overview

Nagios plugin used to monitor datastore performance.

In addition to reporting current datastore performance details, this plugin also reports which VMs reside on the datastore along with their percentage of the total datastore space used. This is intended to help pinpoint potential causes of high latency at a glance.

Output

The output for these plugins is designed to provide the one-line summary needed by Nagios for quick identification of a problem while providing longer, more detailed information for display within the web UI, use in email and Teams notifications (atc0005/send2teams).

Requirements

This plugin requires that the Statistics Collection setting (part of Storage I/O Control) for a monitored datastore be enabled. If it is not, this plugin is unable to evaluate performance for a specified datastore. This plugin attempts to detect and report this condition so that vSphere administrators can assist with enabling this feature.

NOTE: Changing this setting requires elevated privileges in the vSphere environment.

The privileges needed to perform normal sysadmin duties (creating VMs, moving VMs, deleting VMs, uploading/downloading files from the datastore, etc.) are not sufficient to change this setting. If you have a dedicated team that manages your virtual environment you will need to contact them to have this setting changed for every datastore you wish to monitor with this plugin.

To help with locating datastores in need of adjustment, the following PowerCLI snippet may be used:

$credential = Get-Credential -Message "Enter your credentials (DOMAIN\ID)"
$server = Connect-VIServer -Server vc1.example.com -Credential $credential

Get-View -ViewType Datastore |
    Where-Object {$_.IormConfiguration.StatsCollectionEnabled -eq $false} |
    Select -Property Name, @{Label="StatsCollectionEnabled"; Expression={$_.IormConfiguration.StatsCollectionEnabled}} |
    Sort-Object -Property Name

Disconnect-VIServer $server

Available settings For Storage I/O Control:

  • Disabled
  • Statistics enabled but Storage I/O disabled
  • Statistics and Storage I/O enabled

Stability of this plugin

NOTE: This plugin uses the QueryDatastorePerformanceSummary() method provided by the StorageResourceManager Managed Object. While available since vSphere API 5.1, this API is marked as experimental (and subject to change/removal):

This is an experimental interface that is not intended for use in production code.

In addition to using the experimental QueryDatastorePerformanceSummary() API, this plugin uses the deprecated statsCollectionEnabled property from the StorageIORMInfo Data Object to determine whether Statistics Collection is enabled for a datastore. Using the prescribed enabled property for that Data Object to determine Statistics Collection does not work.

If you use this plugin, please provide feedback by opening a new discussion thread.

Performance Data

Background

Initial support has been added for emitting Performance Data / Metrics, but refinement suggestions are welcome.

Consult the list below for the metrics implemented thus far, the original discussion thread and the Add Performance Data / Metrics support project board for an index of the initial implementation work.

Please add to an existing Discussion thread or open a new one with any feedback that you may have. Thanks in advance!

How datastore performance metrics are evaluated

Performance metrics are provided by vSphere in aggregated quantiles over a period of time (intervals). Aggregated metrics correspond with a specific percentile. As of this plugin's initial development, vSphere provides metrics associated with these percentiles:

  • 90
  • 80
  • 70
  • 60
  • 50

If not otherwise specified, percentile 90 is used to evaluate datastore performance metrics. While the vSphere API provides metrics in multiple intervals (one active, up to seven historical), only the active interval is used for evaluating current datastore performance.

There is a brief window between when the current interval ends and the new active interval begins that no metrics are available for the active interval. Testing shows that this is approximately 30 minutes. The current plugin design is to omit performance data latency metrics if no metrics are available. This is done in an attempt to prevent skewing historical data already collected.

This plugin accepts flags to:

  • specify individual latency metric thresholds (e.g., read latency CRITICAL, read latency WARNING, write latency ...)
  • specify percentile sets
    • multiple sets supported, each composed of a percentile and pairs of CRITICAL and WARNING threshold values

If you specify a percentile set, the plugin will not accept individual latency threshold flags. The reverse is also true, specifying one or more latency threshold flags is incompatible with specifying one or more percentile sets.

By specifying multiple percentile sets, you are indicating that crossing the thresholds of any one set is enough to trigger a state change.

Omitted metrics

This plugin emits Nagios performance data metrics for each percentile in the active interval that is not completely of value 0. Any percentile with all 0 metrics are omitted from the performance data metrics collected & emitted by the plugin.

Please provide feedback by opening a new issue if you find that this decision causes problems with gathering metrics.

See the main project README for details.

Supported metrics

NOTE: These metrics are based on the visibility of the service account used to login to the target VMware environment. If the service account cannot see a resource, it cannot evaluate the resource.

Metric Unit of Measurement Description
time milliseconds plugin runtime
vms all (visible) virtual machines in the inventory
vms_powered_on virtual machines powered on
vms_powered_off virtual machines powered off
p*_read_latency milliseconds aggregated datastore latency for read operations
p*_write_latency milliseconds aggregated datastore latency for write operations
p*_vm_latency milliseconds aggregated datastore latency as observed by VirtualMachines using the datastore
p*_read_iops reads per second aggregated datastore read I/O rate
p*_read_iops writes per second aggregated datastore write I/O rate

NOTE: * is a placeholder for 90, 80, 70, 60 & 50 percentiles.

Optional evaluation

Some plugins provide optional support to limit evaluation of VMs to specific Resource Pools (explicitly including or excluding) and power states (on or off). Other plugins support similar filtering options (e.g., Acknowledged state of Triggered Alarms). See the configuration options, examples and contrib sections for more information.

Installation

See the main project README for details.

Configuration options

Threshold calculations

TODO: Research & note why metric sets might contain all values of 0.

Nagios State Description
OK Ideal state, Datastore performance within bounds for the active interval for the chosen percentile(s).
UNKNOWN Datastore performance metric sets are all value 0 or metrics collection for a datastore is disabled.
WARNING Datastore performance crossed user-specified latency thresholds for this state.
CRITICAL Datastore performance crossed user-specified latency thresholds for this state.

Command-line arguments

  • Use the -h or --help flag to display current usage information.
  • Flags marked as required must be set via CLI flag.
  • Flags not marked as required are for settings where a useful default is already defined, but may be overridden if desired.
Flag Required Default Repeat Possible Description
branding No false No branding Toggles emission of branding details with plugin status details. This output is disabled by default.
h, help No false No h, help Show Help text along with the list of supported flags.
v, version No false No v, version Whether to display application version and then immediately exit application.
ll, log-level No info No disabled, panic, fatal, error, warn, info, debug, trace Log message priority filter. Log messages with a lower level are ignored. Log messages are sent to stderr by default. See Output for more information.
p, port No 443 No positive whole number between 1-65535, inclusive TCP port of the remote ESXi host or vCenter instance. This is usually 443 (HTTPS).
t, timeout No 10 No positive whole number of seconds Timeout value in seconds allowed before a plugin execution attempt is abandoned and an error returned.
s, server Yes No fully-qualified domain name or IP Address The fully-qualified domain name or IP Address of the remote ESXi host or vCenter instance.
u, username Yes No valid username Username with permission to access specified ESXi host or vCenter instance.
pw, password Yes No valid password Password used to login to ESXi host or vCenter instance.
domain No No valid user domain (Optional) domain for user account used to login to ESXi host or vCenter instance. This is needed for user accounts residing in a non-default domain (e.g., SSO specific domain).
trust-cert No false No true, false Whether the certificate should be trusted as-is without validation. WARNING: TLS is susceptible to man-in-the-middle attacks if enabling this option.
dc-name No No valid vSphere datacenter name Specifies the name of a vSphere Datacenter. If not specified, applicable plugins will attempt to use the default datacenter found in the vSphere environment. Not applicable to standalone ESXi hosts.
ds-name Yes No valid datastore name Datastore name as it is found within the vSphere inventory.
dsim, ds-ignore-missing-metrics No false No true, false Toggles how missing Datastore Performance metrics will be handled.This is believed to occur when a datastore is newly created and metrics have not yet been collected.
dshhms, ds-hide-historical-metric-sets No false No true, false Toggles display of historical Datastore Performance metrics at plugin completion. By default historical metrics are listed.
dsrlc, ds-read-latency-critical No 15 No positive whole number or float Specifies the read latency of a datastore's storage (in ms) when a CRITICAL threshold is reached. The default percentile is used (90).
dsrlw, ds-read-latency-warning No 30 No positive whole number or float Specifies the read latency of a datastore's storage (in ms) when a WARNING threshold is reached. The default percentile is used (90).
dswlc, ds-write-latency-critical No 15 No positive whole number or float Specifies the write latency of a datastore's storage (in ms) when a CRITICAL threshold is reached. The default percentile is used (90).
dswlw, ds-write-latency-warning No 30 No positive whole number or float Specifies the write latency of a datastore's storage (in ms) when a WARNING threshold is reached. The default percentile is used (90).
dsvmlc, ds-vm-latency-critical No 15 No positive whole number or float Specifies the latency (in ms) as observed by VMs using the datastore when a CRITICAL threshold is reached. The default percentile is used (90).
dsvmlw, ds-vm-latency-warning No 30 No positive whole number or float Specifies the latency (in ms) as observed by VMs using the datastore when a WARNING threshold is reached. The default percentile is used (90).
dslps, ds-latency-percentile-set No 90,15,30,15,30,15,30 Yes complete percentile set in P,RLW,RLC,WLW,WLC,VMLW,VMLC format Specifies the performance percentile set used for threshold calculations. Incompatible with individual latency threshold flags. All comma-separated field values are required for each set.

Configuration file

Not currently supported. This feature may be added later if there is sufficient interest.

Contrib

See the main project README for details.

Examples

CLI invocation

/usr/lib/nagios/plugins/check_vmware_datastore_performance --server vc1.example.com --username SERVICE_ACCOUNT_NAME --password "SERVICE_ACCOUNT_PASSWORD" --ds-latency-percentile-set '90,15,30,15,30,15,30' --ds-name "HUSVM-DC1-vol6" --trust-cert  --log-level info

See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.

Of note:

  • We use a datastore performance percentile set instead of individual latency flags
    • 90th percentile
    • read latency WARNING threshold of 15 ms
    • read latency CRITICAL threshold of 30 ms
    • write latency WARNING threshold of 15 ms
    • write latency CRITICAL threshold of 30 ms
    • vm latency WARNING threshold of 15 ms
    • vm latency CRITICAL threshold of 30 ms
  • Due to plugin design, only the active interval is evaluated for threshold violations
    • historical interval metrics are reported via LongServiceOutput unless the flag to skip emitting those metrics is specified
  • Certificate warnings are ignored.
    • not best practice, but many vCenter instances use self-signed certs per various freely available guides
  • Service Check results output is sent to stdout
  • Logging output is enabled at the info level.
    • logging output is sent to stderr by default
    • logging output is intended to be seen when invoking the plugin directly via CLI (often for troubleshooting)
      • see the Output section of the main README for potential conflicts with some monitoring systems

Command definition

# /etc/nagios-plugins/config/vmware-datastores-performance.cfg

# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values via individual flags.
define command{
    command_name    check_vmware_datastore_performance_via_individual_flags
    command_line    $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-read-latency-warning '$ARG4$' --ds-read-latency-critical '$ARG5$' --ds-write-latency-warning '$ARG6$' --ds-write-latency-critical '$ARG7$' --ds-vm-latency-warning '$ARG8$' --ds-vm-latency-critical '$ARG9$' --ds-name '$ARG10$' --trust-cert  --log-level info
    }

# Look at specific datastore and explicitly provide custom WARNING and
# CRITICAL latency threshold values for a single percentile via a percentile
# flag set.
define command{
    command_name    check_vmware_datastore_performance_via_1percentile_set
    command_line    $USER1$/check_vmware_datastore_performance --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-latency-percentile-set '$ARG4$' --ds-name '$ARG5$' --trust-cert  --log-level info
    }

See the configuration options section for all command-line settings supported by this plugin along with descriptions of each. See the contrib section for information regarding example command definitions and Nagios configuration files.

Troubleshooting

Datastore storage I/O statistics collection disabled

If you see an error message like this one:

UNKNOWN: Unable to retrieve performance summary for datastore "DATASTORE_NAME_HERE": datastore storage I/O statistics collection disabled

**ERRORS**

* datastore storage I/O statistics collection disabled: assistance needed from vmware administrators to resolve issue

then it means that the required Statistics Collection setting for the specified datastore is not enabled. See the Requirements section of this documentation for more information.

License

See the main project README for details.

References