奇怪的CloudWatch警报行为 [英] Strange CloudWatch alarm behaviour

查看:141
本文介绍了奇怪的CloudWatch警报行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每2小时运行一次的备份脚本。我想使用CloudWatch跟踪此脚本的成功执行,并使用CloudWatch的警报在脚本遇到问题时得到通知。

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.

该脚本在CloudWatch上放置了一个数据点每次成功备份后的度量标准:

The script puts a data point on a CloudWatch metric after every successful backup:

    mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1

我有一个警报,只要指标在6小时内少于2。

I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.

为了测试此设置,一天后,我停止将数据放入指标中(即,我评论了out mon-put-data命令)。很好,最终警报进入了ALARM状态,并且按预期方式我收到了一封电子邮件通知。

In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.

问题是一段时间后,警报又恢复了正常状态,但是没有新数据添加到度量标准!

The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!

两个转换(OK => ALARM,然后ALARM => OK)已记录下来,我重现了登录此问题。注意,尽管两者都显示 period:21600(即6h),但第二个显示的是startDate和queryDate之间的12小时时间跨度;我看到这可能解释了过渡,但是我不明白为什么CloudWatch考虑考虑12小时的时间跨度来计算6小时的统计信息!

The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!

什么我在这里想念吗?如何配置警报以实现所需的警报(即,如果未进行备份则得到通知)?

What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

{
    "Timestamp": "2013-03-06T15:12:01.069Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-05T21:12:44.081+0000",
                "startDate": "2013-03-05T15:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 3
            }
        },
        "newState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from OK to ALARM"
}

第二个,我简单无法理解:

The second one, which I simple cannot understand:

{
    "Timestamp": "2013-03-06T17:46:01.063Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        },
        "newState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T17:46:01.041+0000",
                "startDate": "2013-03-06T05:46:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from ALARM to OK"
}


推荐答案

此行为(您的监视器未转换为INSFUCCIENT_DATA状态是因为Cloudwatch考虑了已加盖时间戳的指标数据点,依此类推) 6小时警报),如果当前6个开放小时窗口中没有数据..它将从前6小时窗口中获取数据(因此,您将在上面看到12小时时间戳)。

This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).

要提高警报的保真度,请将警报时间减少到1小时/ 3600s,并将评估时间数增加到要为故障警报的时间数。这样可以确保您的警报按您期望的那样转换为INSFUCCIENT_DATA。

To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.


如何配置警报以实现我想要的功能(即,如果发生警报,​​则得到通知不会进行备份)?

How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

可能的警报结构是:如果工作成功,则发布1;如果失败,则发布0。 。然后创建一个阈值小于< 1表示3-3600s的时间,这意味着如果作业失败(即正在运行..但失败),您的警报将进入警报状态。如果您还对该警报设置了INSFUCCIENT_DATA操作,那么如果您的作业根本没有运行,您还将收到通知。

A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.

希望如此。

这篇关于奇怪的CloudWatch警报行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆