Skip to content
This repository has been archived by the owner on Feb 8, 2024. It is now read-only.

QA Jira issues guidelines in application to LDR R1 cluster behavior

Andrei Zheregelia edited this page Nov 2, 2020 · 1 revision

QA Jira issues guidelines in application to LDR R1 cluster behavior

Mandatory information to be written in Jira issue

  1. Write as much details as possible to "Steps performed" section (also known as "Steps to reproduce")
    Rationale: observed failure is just consequences. Context is important to investigate the root cause.

  2. Write timestamps against "Steps performed" or make notes in system logs.
    Rationale: to define boundaries where to start logs analysis. It is a huge help if cluster is alive for a long time and logs are full of "interesting" but unrelated events.
    Approach 1 (preferred): use logger utility to create events in system log.
    Example: logger 'XXX: test started'
    Approach 2: just specify timestamps in issue description for main steps in test procedure.
    Note: this approach is prone to timezone problem. Some support bundles generate logs in UTC format.

  3. Specify "Actual behavior" and "Expected behavior"
    Rationale: Sometimes the test procedure can be incorrect from the point of expectations.

  4. Link Jira issue with Test procedure described if possible (if such issue exists).
    Rationale: This helps understand context better and validate test procedure itself.

Logs to be attached

Generate and attach support bundle for both nodes in the cluster.
In time of doubt generate full support bundle.
Each issue assigned to EES-HA component shall have at least Hare support bundle generated by hctl reportbug command executed on both nodes.
Rationale: logs are crucial since they contain real facts to be analyzed and compared against issue description. Approach to use support bundle provides huge cumulative effect for customer support allowing component teams to improve logging if something interesting is missing.

Basic analysis

Check pcs status command output

Since all software components are executed and controlled by pacemaker HA system any failures have a good chance to be noticed and registered by pacemaker.
Search for `Failed Resource Actions:' and 'Failed Fencing Actions:'
Resources listed in those sections can be searched (use grep) in system logs (journalctl) to find what actually happened with resource.
Resource can be represented by systemd unit. To find out systemd unit name use following command:

pcs resource show <resource-name>

Knowing systemd unit name following actions can be done:

  • journalctl -u <unit-name> - show subset of system logs related to specified unit
  • journalctl -x <unit-name> - show system logs for unit complemented by systemd context information. If errors are found there - respective component can be considered for assignment.

Validate deployment parameters

Applicable to situation when pcs status output is not OK right after deployment.
Example: Disabled stonith resources.

Sometimes issue can be caused by errors in deploy arguments. Deploy arguments is stored in *.sls files. Currently, following files are relevant:

/var/lib/cortx/provisioner/srvnode-{1,2}/shared/srv/pillar/groups/all/uu_cluster.sls
/var/lib/cortx/provisioner/srvnode-{1,2}/shared/srv/pillar/groups/all/uu_storage_enclosure.sls

Note: path or file name can change during product development, so generic advice is to use locate tool from mlocate package to search for cluster.sls and storage_enclosure.sls files.
Example:

yum install -y mlocate
updatedb
locate cluster.sls

Grep cotrx-ha and cortx-prvsnr source code to find who is responsible for resource creation.

Repos:

  • cortx-ha: git@github.com:Seagate/cortx-prvsnr.git
  • cortx-prvsnr: git@github.com:Seagate/cortx-ha.git Command example:
cd <repo-name>
grep -R <resource-name> ./

Search for *.sls files in case of Provisioner repo and for bash scripts in case of Cortx-HA repo. This trick can help to understand which component creates resource in case of failures during deployment and/or update.
However, in time of doubt just assign to EES-HA due to Pacemaker expertise.