奇怪的CloudWatch警报行为 [英] Strange CloudWatch alarm behaviour

查看：141 发布时间：2020/6/3 23:49:38 amazon-web-services amazon-cloudwatch

本文介绍了奇怪的CloudWatch警报行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个每2小时运行一次的备份脚本。我想使用CloudWatch跟踪此脚本的成功执行，并使用CloudWatch的警报在脚本遇到问题时得到通知。

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.

该脚本在CloudWatch上放置了一个数据点每次成功备份后的度量标准：

The script puts a data point on a CloudWatch metric after every successful backup:

    mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1

我有一个警报，只要指标在6小时内少于2。

I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.

为了测试此设置，一天后，我停止将数据放入指标中（即，我评论了out mon-put-data命令）。很好，最终警报进入了ALARM状态，并且按预期方式我收到了一封电子邮件通知。

In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.

问题是一段时间后，警报又恢复了正常状态，但是没有新数据添加到度量标准！

The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!

两个转换（OK => ALARM，然后ALARM => OK）已记录下来，我重现了登录此问题。注意，尽管两者都显示 period：21600（即6h），但第二个显示的是startDate和queryDate之间的12小时时间跨度；我看到这可能解释了过渡，但是我不明白为什么CloudWatch考虑考虑12小时的时间跨度来计算6小时的统计信息！

The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!

什么我在这里想念吗？如何配置警报以实现所需的警报（即，如果未进行备份则得到通知）？

What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

{
    "Timestamp": "2013-03-06T15:12:01.069Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-05T21:12:44.081+0000",
                "startDate": "2013-03-05T15:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 3
            }
        },
        "newState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from OK to ALARM"
}

第二个，我简单无法理解：

The second one, which I simple cannot understand:

{
    "Timestamp": "2013-03-06T17:46:01.063Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        },
        "newState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T17:46:01.041+0000",
                "startDate": "2013-03-06T05:46:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from ALARM to OK"
}

奇怪的CloudWatch警报行为 [英] Strange CloudWatch alarm behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

奇怪的CloudWatch警报行为 [英] Strange CloudWatch alarm behaviour

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭