Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus SNS receiver proposal #2559

Closed
maxbrodin opened this issue Apr 27, 2021 · 13 comments
Closed

Prometheus SNS receiver proposal #2559

maxbrodin opened this issue Apr 27, 2021 · 13 comments

Comments

@maxbrodin
Copy link
Contributor

maxbrodin commented Apr 27, 2021

Prometheus SNS receiver

Problem

AlertManager allows to define receivers - notification integration with email, webhook and third-party integrations like PagerDuty, OpsGenie, Slack and others.
Currently there is no integration with Amazon Simple Notification Service which provides fully managed pub/sub messaging, SMS, email, and mobile push notifications. There is a workaround with webhook receiver as a proxy, but it lacks support of AlertManager templates and requires setup and maintenance of additional component.

Proposed solution

This proposal is to add Prometheus SNS receiver - native support of notification integration with Amazon SNS

Message destinations

Prometheus SNS receiver can publish messages to the following destinations:

  • Amazon SNS topic
  • SMS message directly to a phone number
  • notification to a mobile platform endpoint
  • email endpoint

SNS Publish API

In order to publish message to an SNS topic the following HTTP request parameters are required:
Common:

  • API Regional URL
  • API Version

Specific for each request

  • SNS Topic ARN (or Phone number for SMS or TartgetARN for mobile notifications)
  • Message
  • Request Signature Version 4

Optional

  • Subject (when the message is delivered to email endpoints)

Prometheus SNS receiver configuration

<sns_config>

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The SNS API URL i.e. https://sns.us-east-2.amazonaws.com
api_url: <tmpl_string>

# The SNS API version i.e. 
[ api_version: <string> | default = sns.default.api_version ]

# Configures AWS's Signature Verification 4 signing process to sign requests.
sigv4:
  [ <sigv4_config> ]
  
# SNS topic ARN, i.e. arn:aws:sns:us-east-2:698519295917:My-Topic
# If you don't specify this value, you must specify a value for the phone_number or target_arn.
[ topic_arn: <tmpl_string>]

# Subject line when the message is delivered to email endpoints.
[ subject: <tmpl_string> ] 

# Phone number if message is delivered via SMS.
# If you don't specify this value, you must specify a value for the topic_arn or target_arn.
[ phone_number: <tmpl_string> ] 

# The  mobile platform endpoint ARN if message is delivered via mobile notifications.
# If you don't specify this value, you must specify a value for the topic_arn or phone_number.
[ target_arn: <tmpl_string> ] 

# The message content of the SNS notification.
[ message: <tmpl_string> | default = '{{ template "sns.default.message" .}}' ] 

# SNS message attributes.
attributes: 
  [ <attribute_config> ]

# The HTTP client's configuration.
[ http_config: <http_config> | default = global.http_config ]

<sigv4_config>

# The AWS region. If blank, the region from the default credentials chain
# is used.
[ region: <string> ]

# The AWS API keys. If blank, the environment variables `AWS_ACCESS_KEY_ID`
# and `AWS_SECRET_ACCESS_KEY` are used.
[ access_key: <string> ]
[ secret_key: <secret> ]

# Named AWS profile used to authenticate.
[ profile: <string> ]

# AWS Role ARN, an alternative to using AWS API keys.
[ role_arn: <string> ]

<attribute_config>

<tmpl_string>: <tmpl_string>

Examples

SNS Topic

sns_configs:
  - api_url: https://sns.us-east-2.amazonaws.com
    topic_arn: arn:aws:sns:us-east-2:123456789012:My-Topic
    sigv4:
      region: us-east-2
      role_arn: arn:aws:iam::123456789012:role/alertmanager_role
    attributes:
       - key: severity
         value: SEV2

SMS

sns_configs:
  - api_url: https://sns.us-east-2.amazonaws.com
    phone_number: +17785522312
    message: {{ template "sns.default.sms_message" . }}
    sigv4:
      region: us-east-2
      role_arn: arn:aws:iam::123456789012:role/alertmanager_role

Mobile notification

sns_configs:
  - api_url: https://sns.us-east-2.amazonaws.com
      target_arn: arn:aws:sns:us-west-2:123456789012:endpoint/APNS_SANDBOX/pushapp/98e9ced9-f136-3893-9d60-776547eafebb
      message: {{ template "sns.default.mobile_message" . }}
      sigv4:
      region: us-east-2
      role_arn: arn:aws:iam::123456789012:role/alertmanager_role

Email

sns_configs:
  - api_url: https://sns.us-east-2.amazonaws.com
    subject: {{ template "email.default.subject" . }}
    message: {{ template "email.default.html" . }}
    sigv4:
      region: us-east-2
      role_arn: arn:aws:iam::123456789012:role/alertmanager_role

Message size

Due to SNS message constraints:

  • With the exception of SMS, messages must be UTF-8 encoded strings and at most 256 KB in size (262,144 bytes, not 262,144 characters).

  • For SMS, each message can contain up to 140 characters. This character limit depends on the encoding schema. For example, an SMS message can contain 160 GSM characters, 140 ASCII characters, or 70 UCS-2 characters.

  • The total size limit for a single SMS Publish action is 1,600 characters.

Prometheus SNS receiver must truncate message according to constraints:

  • 256 KB for SNS topic message
  • 140 characters for SMS messages
  • 1600 characters for total size for a single SMS request

Truncation strategy

If message doesn’t fit in 256Kb limit SNS receiver will truncate message content (Note, that message body is required and can’t be empty).

If message still can’t fit into the limit we will truncate message attributes one by one until message won’t fit the size limit.

If SNS receiver truncates the message a new SNS message attribute with key "truncated" and value "true" will be added to the message to indicate that the notification message was truncated.

Deduplication Key

In order to correlate alarm triggers and alarm resolves we publish a special “deduplicationKey” attribute with a value of the hash of GroupKey similar to PagerDuty, OpsGenie and VictorOps

Default SNS message format

Currently some receivers have default message template like

Default SNS message format will contain the following information (to be confirmed):

{{ define "sns.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
   {{ if gt (len .Alerts.Firing) 0 -}}
   Alerts Firing:
   {{ template "__text_alert_list" .Alerts.Firing }}
   {{- end }}   
   {{ if gt (len .Alerts.Resolved) 0 -}}   
   Alerts Resolved:
   {{ template "__text_alert_list" .Alerts.Resolved }}
   {{- end }}
{{- end }}

Similar issue :#2525

@roidelapluie
Copy link
Member

Thanks.

I would sponsor this.

We would more likely move sigv4 out of prometheus/prometheus to prometheus/common.

I am willing to accept this in a dedicated go module in prometheus/common, so users depending on common do not depend on aws sdk.

  • Can attribute config be inlined as a map of <tmpl_string>: <tmpl_string> ?
  • is api_url a secret?

@maxbrodin
Copy link
Contributor Author

Thanks.

I would sponsor this.

We would more likely move sigv4 out of prometheus/prometheus to prometheus/common.

I am willing to accept this in a dedicated go module in prometheus/common, so users depending on common do not depend on aws sdk.

  • Can attribute config be inlined as a map of <tmpl_string>: <tmpl_string> ?
  • is api_url a secret?
  • Yes, makes sense to have them as a map of <tmpl_string>: <tmpl_string>
  • api_url should be <tmpl_string> as well

Thank you.

@roidelapluie
Copy link
Member

Are you willing to work on this?

It seems the first step is to extract the sigv4 code from prometheus in a new go mod in prometheus/common. Happy to help/answer questions.

@kevinayres
Copy link

This would be amazing. At SUSE we have multiple AWS customers and AWS SA's requesting this connection between AlertManager and SNS for SAP HA environments. We package a number of Prometheus Exporters which they leverage but all have requested TXT/SMS alerts for these prod environments via a native service (SNS.) They try to do this with Lambda and Cloudwatch to SNS but most that I've heard from want to use Prometheus.

@tomwilkie
Copy link
Member

@treid314 is going to have a stab at this in the coming weeks!

@treid314 treid314 mentioned this issue Jun 10, 2021
5 tasks
@treid314
Copy link
Contributor

Deduplication Key
In order to correlate alarm triggers and alarm resolves we publish a special “deduplicationKey” attribute with a value of the hash of GroupKey similar to PagerDuty, OpsGenie and VictorOps

While implementing the Deduplication Key logic we found that the hashed group key, like we use for other notifers, is not unique enough to prevent us from de-duping sns messages that contain different labels and data sent from our notifier. That means that a user would not be able to publish to a topic until the SNS de-dupe timelimit is up. I'm weary to suggest using a hash of the message itself since that's content-based deduplication SNS does itself.

Are there any suggestions to create a better deduplication key for this notifier?

@treid314
Copy link
Contributor

Truncation strategy

If message doesn’t fit in 256Kb limit SNS receiver will truncate message content (Note, that message body is required and can’t be empty).

If message still can’t fit into the limit we will truncate message attributes one by one until message won’t fit the size limit.

@maxbrodin I'm a bit confused by this truncation strategy here, it seems from the AWS docs that the message attributes and message length are unrelated to the size of the message itself such that removing message attributes won't effect the total message length. Is there something I'm missing?

@roidelapluie
Copy link
Member

The content of the message can vary from one minute to the other for the same alert. Because annotations can contain changing data like values, query results, even from multiple prometheus servers. It can not be used for hashing.

@treid314
Copy link
Contributor

In the SNS SDK the API version is hard coded in the client metadata for the SNS client, making the SDK bound to a specific API Version. I propose we remove the api_version option from the alert manager sns config since it is set by the SDK.

@treid314
Copy link
Contributor

Deduplication Key
In order to correlate alarm triggers and alarm resolves we publish a special “deduplicationKey” attribute with a value of the hash of GroupKey similar to PagerDuty, OpsGenie and VictorOps

While implementing the Deduplication Key logic we found that the hashed group key, like we use for other notifers, is not unique enough to prevent us from de-duping sns messages that contain different labels and data sent from our notifier. That means that a user would not be able to publish to a topic until the SNS de-dupe timelimit is up. I'm weary to suggest using a hash of the message itself since that's content-based deduplication SNS does itself.

Are there any suggestions to create a better deduplication key for this notifier?

I want to bring this back up before we complete this issue. Right now you would not be able to publish another message for 5 minutes (SNS de-dupe time limit) to a fifo queue with a hashed group key. I think we need to consider adding to what we use to compute the hash to get to a deduplication key that allows for us have some more control over what should be deduplicated on the SNS side.

Are there any suggestions for what to add to our hash to handle this issue better?

@roidelapluie
Copy link
Member

We could hash all the labels from all the alerts, but not the annotations.

@alvinlin123
Copy link
Contributor

I think it's fine, initially, to just use group key hash as SNS' dedupe key and group key. 5 minute as minimum value for group_internval seems reasonable if the receiver is SNS FIFO; the official Prometheus alert manager doc recommends 5 minutes or more too. If there are alerts within the same group that I want to receive less than 5 minutes apart, then I can either use non-fifo topic or separate them out to different groups.

Later on if there is strong need, then we can choose to introduce new SNS receiver config like sns_fifo_topic_dedupe_strategy.

@roidelapluie
Copy link
Member

Implemented #2615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants