向EC2自动伸缩组添加容量的AWS CloudWatch Alarm一直处于警报状态 [英] AWS CloudWatch Alarm to add capacity to EC2 autoscaling group has been in alarm forever

查看:184
本文介绍了向EC2自动伸缩组添加容量的AWS CloudWatch Alarm一直处于警报状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将CloudWatch警报设置为在内存预留大于70%时向EC2自动扩展组添加1个容量单位。警报是在正确的时刻触发的,但是此后警报已超过16小时以上,而EC2自动缩放组中根本没有任何变化。

I set a CloudWatch Alarm to add 1 capacity unit to EC2 autoscaling group when memory reservation is > 70%. The Alarm was triggered at the right moment, but it has since been in alarm for 16 hours+ with no change at all in the EC2 autoscaling group. What could possibly be going wrong?

这是我的ECS CloudFormation模板:

Here's my ECS CloudFormation template:

ECSCluster:
  Type: AWS::ECS::Cluster
  Properties:
    ClusterName: !Ref EnvironmentName

ECSAutoScalingGroup:
  DependsOn: ECSCluster
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    VPCZoneIdentifier: !Ref Subnets
    LaunchConfigurationName: !Ref ECSLaunchConfiguration
    MinSize: !Ref ClusterMinSize
    MaxSize: !Ref ClusterMaxSize
    DesiredCapacity: !Ref ClusterDesiredCapacity
  CreationPolicy:
    ResourceSignal:
      Timeout: PT15M
  UpdatePolicy:
    AutoScalingRollingUpdate:
      MinInstancesInService: 1
      MaxBatchSize: 1
      PauseTime: PT15M
      SuspendProcesses:
        - HealthCheck
        - ReplaceUnhealthy
        - AZRebalance
        - AlarmNotification
        - ScheduledActions
      WaitOnResourceSignals: true

ScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '1'

MemoryReservationAlarmHigh:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '70'
    AlarmDescription: Alarm if Cluster Memory Reservation is too high
    Period: '60'
    AlarmActions:
    - Ref: ScaleUpPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: GreaterThanThreshold
    MetricName: MemoryReservation

ScaleDownPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '-1'

MemoryReservationAlarmLow:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '30'
    AlarmDescription: Alarm if Cluster Memory Reservation is too Low
    Period: '60'
    AlarmActions:
    - Ref: ScaleDownPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: LessThanThreshold
    MetricName: MemoryReservation

ECSLaunchConfiguration:
  Type: AWS::AutoScaling::LaunchConfiguration
  Properties:
    KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
    ImageId: !Ref ECSAMI
    InstanceType: !Ref InstanceType
    SecurityGroups:
      - !Ref SecurityGroup
    IamInstanceProfile: !Ref ECSInstanceProfile
    UserData:
      "Fn::Base64": !Sub |
        #!/bin/bash
        source /etc/profile.d/proxy.sh
        yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
        yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
        yum install -y aws-cfn-bootstrap hibagent
        cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
        [proxy]
            http_proxy="${!http_proxy}"
            https_proxy="${!https_proxy}"
            no_proxy="${!no_proxy}"
        EOF
        /opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
        /opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
        /usr/bin/enable-ec2-spot-hibernation

  Metadata:
    AWS::CloudFormation::Init:
      config:
        packages:
          yum:
            collectd: []

        commands:
          01_add_instance_to_cluster:
            command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
          02_enable_cloudwatch_agent:
            command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
        files:
          /etc/cfn/cfn-hup.conf:
            mode: 000400
            owner: root
            group: root
            content: !Sub |
              [main]
              stack=${AWS::StackId}
              region=${AWS::Region}

          /etc/cfn/hooks.d/cfn-auto-reloader.conf:
            content: !Sub |
              [cfn-auto-reloader-hook]
              triggers=post.update
              path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
              action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration

        services:
          sysvinit:
            cfn-hup:
              enabled: true
              ensureRunning: true
              files:
                - /etc/cfn/cfn-hup.conf
                - /etc/cfn/hooks.d/cfn-auto-reloader.conf

# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.

ECSRole:
  Type: AWS::IAM::Role
  Properties:
    Path: /
    RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
    AssumeRolePolicyDocument: |
      {
          "Statement": [{
              "Action": "sts:AssumeRole",
              "Effect": "Allow",
              "Principal": {
                  "Service": "ec2.amazonaws.com"
              }
          }]
      }
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    Policies:
      - PolicyName: ecs-service
        PolicyDocument: |
          {
              "Statement": [{
                  "Effect": "Allow",
                  "Action": [
                      "ecs:CreateCluster",
                      "ecs:DeregisterContainerInstance",
                      "ecs:DiscoverPollEndpoint",
                      "ecs:Poll",
                      "ecs:RegisterContainerInstance",
                      "ecs:StartTelemetrySession",
                      "ecs:Submit*",
                      "ecr:BatchCheckLayerAvailability",
                      "ecr:BatchGetImage",
                      "ecr:GetDownloadUrlForLayer",
                      "ecr:GetAuthorizationToken"
                  ],
                  "Resource": "*"
              }]
          }

ECSInstanceProfile:
  Type: AWS::IAM::InstanceProfile
  Properties:
    Path: /
    Roles:
      - !Ref ECSRole

ECSServiceAutoScalingRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        Action:
          - "sts:AssumeRole"
        Effect: Allow
        Principal:
          Service:
            - application-autoscaling.amazonaws.com
    Path: /
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
    Policies:
      - PolicyName: ecs-service-autoscaling
        PolicyDocument:
          Statement:
            Effect: Allow
            Action:
              - application-autoscaling:*
              - cloudwatch:DescribeAlarms
              - cloudwatch:PutMetricAlarm
              - ecs:DescribeServices
              - ecs:UpdateService
            Resource: "*"

ECSCloudWatchParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: CloudWatch Log configs for ECS cluster
    Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
    Type: String
    Value: !Sub |
      {
        "logs": {
          "force_flush_interval": 5,
          "logs_collected": {
            "files": {
              "collect_list": [
                {
                  "file_path": "/var/log/messages",
                  "log_group_name": "${ECSCluster}/var/log/messages",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%b %d %H:%M:%S"
                },
                {
                  "file_path": "/var/log/dmesg",
                  "log_group_name": "${ECSCluster}/var/log/dmesg",
                  "log_stream_name": "{instance_id}"
                },
                {
                  "file_path": "/var/log/docker",
                  "log_group_name": "${ECSCluster}/var/log/docker",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
                },
                {
                  "file_path": "/var/log/ecs/ecs-init.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/ecs-agent.log.*",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/audit.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                }
              ]
            }
          }
        },
        "metrics": {
          "append_dimensions": {
            "AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
            "InstanceId": "${!aws:InstanceId}",
            "InstanceType": "${!aws:InstanceType}"
          },
          "metrics_collected": {
            "collectd": {
              "metrics_aggregation_interval": 60
            },
            "disk": {
              "measurement": [
                "used_percent"
              ],
              "metrics_collection_interval": 60,
              "resources": [
                "/"
              ]
            },
            "mem": {
              "measurement": [
                "mem_used_percent"
              ],
              "metrics_collection_interval": 60
            },
            "statsd": {
              "metrics_aggregation_interval": 60,
              "metrics_collection_interval": 10,
              "service_address": ":8125"
            }
          }
        }
      }

ECSClusterParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Cluster
    Name: !Sub /${EnvironmentName}/ecs-cluster
    Type: String
    Value: !Ref ECSCluster

ECSServiceAutoScalingRoleParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Service ASG Role
    Name: !Sub /${EnvironmentName}/ecs-service-asg-role
    Type: String
    Value: !GetAtt ECSServiceAutoScalingRole.Arn

警报活动历史记录:

2019-12-26 11:40:54 Action  Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update  Alarm updated from OK to In alarm


推荐答案

确保没有任何进程挂起。警报通知意味着传入的警报不会触发缩放策略。启动意味着即使期望值上升,也不会启动任何东西

Make sure there aren't any processes suspended. Alarm notification means that incoming alarms won't trigger scaling policies. Launch means even if the desired goes up nothing will be launched

其他可能导致此问题的常见问题:

Other common issues that can cause this:


  • 如果使用权重并将期望值增加1,但最低权重不是1,则可能永远无法缩放。

  • If you're using weights and increasing desired by 1, but the lowest weigh isn't 1, then it might never be able to scale.

请确保没有触发任何其他扩展策略,这些策略可能会覆盖此策略。

Make sure there aren't any other scaling policies being triggered that might override this one

检查活动历史记录,以确保没有持续进行任何健康检查替换,因为这将开始5分钟的冷却时间(默认值,因为没有在ASG上设置,只有缩放策略),并且会阻止简单缩放策略

Check the activity history to make sure there aren't any healthcheck replacements constantly happening, since that would start a 5 minute cooldown (default since one isn't set on the ASG, only the scaling policy), and would block simple scaling policies

确保所需的值尚未达到最大值

Make sure the desired isn't already at the Max

In除了触发警报外,请确保在警报历史记录中看到发生了自动缩放的操作(该操作实际上在警报保持在警报状态的每一分钟都发生,没有其他评估设置,但只有第一个得到发布到警报历史记录中)

In addition to the alarm being triggered, make sure you see in the Alarm history that the autoscaling 'action' happened (The action actually happens every minute the alarm stays in the Alarm state, no mater what your evaluation settings, but only the first one gets posted to the Alarm history)

检查ASG活动历史记录是否存在启动失败,这在使用竞价型实例时尤其常见,并且ASG最终会进入退避状态失败后的状态。对群组进行的任何手动更新都会重置此补偿

Check the ASG Activity history for launch failures, this is especially common if using spot instances, and the ASG will eventually enter a backoff state after enough failures. Any manual update to the group will reset this backoff

这篇关于向EC2自动伸缩组添加容量的AWS CloudWatch Alarm一直处于警报状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆