Skip to content

Alert Response Guide

Brandon Cruz edited this page Jan 30, 2024 · 2 revisions

Alert Response Guide

Generated for BFD-2955 Alerts Audit The following lists include all alerts created for the Production Environment.

Splunk Alerts

The following are configured in Splunk's BFD App.

Name Description Alert Type Trigger Cause Action Status
BFD ETL - Data load longer than 48 hours This alert runs every Monday morning and will alert if a data load has taken 48 hours or more to process. Scheduled. Weekly, Monday at 10:00 AM ET Data Load has taken more than 48 hours to process CCW File Delay or Pipeline Failed to ingest Check S3 Buckets if data was ingested and Check to see if pipeline instance has issues Disabled
BFD ETL - No data sets found to process This alert runs every Monday morning at 6am (Eastern) and alerts if there were no found data sets in the previous 2 days. E.g., no datasets were found over the weekend. Scheduled. Weekly, Monday at 10:00 AM ET No datasets were found over the weekend CCW File Delay or Pipeline Failed to ingest Check S3 Buckets if data was ingested and Check to see if pipeline instance has issues Enabled
BFD NO LOG INGESTION FOR BFD APP SERVER 1h This alert checks the source of bluebutton-server-app-log.json and alerts if this log has a count of 0 in last 1hr Scheduled. Hourly, at 0 minutes past the hour. Source bluebutton-server-app-log.json has a count of 0 in last 1hr Enabled
BFD NO LOG INGESTION FOR BFD WEB SERVER 1h This alert checks the source of bfd-server/access.json and alerts of this log as a count of 0 in the last 1hr Scheduled. Hourly, at 0 minutes past the hour. Source bfd-server/access.json has a count of 0 in last 1hr Enabled
BFD NO LOG INGESTION FOR ETL PIPELINE 1h This alert checks the source of bluebutton-data-pipeline.log and alerts of this log as a count of 0 in the last 1hr Scheduled. Hourly, at 0 minutes past the hour. Source bluebutton-data-pipeline.log has a count of 0 in last 1hr Enabled
BFD NO LOG INGESTION FOR HOST 1h This alert checks the source of /var/log/messages and alerts of this log as a count of 0 in the last 1h Scheduled. Hourly, at 0 minutes past the hour. Source /var/log/messages has a count of 0 in last 1hr Enabled
BFD-CCW-FOUND-DATASETS-PROD This alert checks the source of bluebutton-data-pipeline.log with message of Found data set to process IN(BENEFICIARY, CARRIER, DME, HHA, HOSPICE, INPATIENT, OUTPATIENT, PDE, SNF) Scheduled. */15 * * * * Source bluebutton-data-pipeline.log contains message of Found data set to process IN(BENEFICIARY, CARRIER, DME, HHA, HOSPICE, INPATIENT, OUTPATIENT, PDE, SNF) Enabled
BFD-CCW-JOB-FAILED-PROD This alert checks the source of bluebutton-data-pipeline.log with message %PipelineJobFailure% Scheduled. */15 * * * * Source bluebutton-data-pipeline.log contains message %PipelineJobFailure% Enabled
BFD-CCW-LOADED-DATASET-PROD This alert checks the source of bluebutton-data-pipeline.log with message of Processed type IN(BENEFICIARY, CARRIER, DME, HHA, HOSPICE, INPATIENT, OUTPATIENT, PDE, SNF) Scheduled. */15 * * * * Source bluebutton-data-pipeline.log contains message of Processed type IN(BENEFICIARY, CARRIER, DME, HHA, HOSPICE, INPATIENT, OUTPATIENT, PDE, SNF) Enabled

New Relic Alerts

The following are configured in New Relic but not configured for notifications. We may want to see if we want to adjust this to send notifications to #bfd-alerts slack channel.

Name Policy Trigger Cause Action Status
BFD Endpoints - Error Rate > 10% BB2 - BFD - Prod Enabled
BFD Endpoints - FHIR BB2 - BFD - Prod Disabled
BFD Endpoints - Health BB2 - BFD - Prod Disabled
BFD Endpoints - Latency 50th Percentile > 6 seconds BB2 - BFD - Prod Enabled
V1 - Anomaly Detected - API Response Time BFD - Anomaly Detected - PROD Enabled
V1 - Anomaly Detected - Error Percentage BFD - Anomaly Detected - PROD Enabled
V2 - Anomaly Detected - Avg Response Time BFD - Anomaly Detected - PROD Enabled
V2 - Anomaly Detected - Error Percentage BFD - Anomaly Detected - PROD Enabled
Error % BFD - V1 API - SLO Violations - PROD Enabled
V1 - Coverage (p95) BFD - V1 API - SLO Violations - PROD Enabled
V1 - EOB _since (p95) BFD - V1 API - SLO Violations - PROD Enabled
V1 - EOB (p95) BFD - V1 API - SLO Violations - PROD Enabled
V1 - Patient (p95) BFD - V1 API - SLO Violations - PROD Enabled
V1 - Patient - by Contract and YearMonth - 4000 (p95) BFD - V1 API - SLO Violations - PROD Enabled
Error % BFD - V2 API - SLO Violations - PROD Disabled
V2 - Coverage (p95) BFD - V2 API - SLO Violations - PROD Disabled
V2 - EOB _since (p95) BFD - V2 API - SLO Violations - PROD Disabled
V2 - EOB (p95) BFD - V2 API - SLO Violations - PROD Disabled
V2 - Patient (p95) BFD - V2 API - SLO Violations - PROD Disabled
V2 - Patient - by Contract and YearMonth - 4000 (p95) BFD - V2 API - SLO Violations - PROD Disabled

Alert Polices

Name # of Alert conditions Alert Conditions Notifications
BB2 - BFD - Prod 4 BFD Endpoints - Error Rate > 10% \nBFD Endpoints - FHIR \nBFD Endpoints - Health \nBFD Endpoints - Latency 50th Percentile > 6 seconds Disabled
BFD - Anomaly Detected - PROD 4 V1 - Anomaly Detected - API Response Time \nV1 - Anomaly Detected - Error Percentage \nV2 - Anomaly Detected - Avg Response Time \nV2 - Anomaly Detected - Error Percentage Disabled
BFD - V1 API - SLO Violations - PROD 6 Error % \nV1 - Coverage (p95) \nV1 - EOB _since (p95) \nV1 - EOB (p95) \nV1 - Patient (p95) \nV1 - Patient - by Contract and YearMonth - 4000 (p95) Disabled
BFD - V2 API - SLO Violations - PROD 6 Error % \nV2 - Coverage (p95) \nV2 - EOB _since (p95) \nV2 - EOB (p95) \nV2 - Patient (p95) \nV2 - Patient - By Contract and YearMonth - 4000 (p95) Disabled
Clone this wiki locally