我如何找到导致EC2自动扩展组“运行状况检查”的原因?失败? (不涉及负载均衡器) [英] How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

查看:111
本文介绍了我如何找到导致EC2自动扩展组“运行状况检查”的原因?失败? (不涉及负载均衡器)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的AWS自动伸缩组中的EC2实例在运行1-4小时后都终止。确切的时间会有所不同,但是一旦发生,整个团队就会在几分钟之内崩溃。

The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.

每个组的扩展历史描述很简单:

The scaling history description for each is simply:


在2016-08-26T05:21:04Z,实例因响应EC2运行状况检查而被终止服务,表明该实例已终止或停止。 / p>

At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.

但是我还没有添加任何健康检查。

But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.

我如何确定此运行状况检查失败的实际含义是什么?

How do I determine what this "health check" failure actually means?

有关ASG终止的大多数问题都引回到负载均衡器,但是我没有负载均衡器。该集群处理批处理作业,并且最小/最大/期望值由软件根据系统中其他位置的工作量积压进行控制。

Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.

ASG历史记录并不表明发生了放大事件,并且实例也受到了明确保护,无法进行放大。

The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.

我尝试将运行状况检查宽限期设置为20小时,以查看是否至少可以使实例停止运行,以便我可以对其进行检查,但它们仍然会终止。

I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.

实例正在运行ECS AMI,并且ECS在容器中运行启动时启动的单个任务。该任务的日志看起来很正常,并且看起来运行良好,直到实例消失前几分钟。

The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.

该任务占用大量CPU,但是当该任务仍然发生时,仍会发生错误我只睡了六个小时。

The task is CPU intensive, but error occurs still when I just have it sleep for six hours.

推荐答案

以下是一些建议:


  • 要查看实例为何终止,请在EC2的实例列表中选择 terminated 实例,然后选择 Get System Log 实例设置(菜单)中的em>,然后向下滚动到底部以查看任何明显的问题。实例终止后,日志会保留一段时间。

  • 在活动服务中的ECS群集中,检查 Events 选项卡中是否有任何消息。

  • 目标组部分中,验证健康检查目标已注册目标及其可用区域状态健康

  • To see why instance was terminated, in EC2's Instance list select terminated instance, and select Get System Log in Instance Settings (menu), then scroll down to the bottom to see any obvious issues. The logs are kept for a while after instance is terminated.
  • In ECS cluster within your active service, check Events tab for any messages.
  • In Target Group section, verify Health checks and Targets (Registered targets and their Status, and Health of the Availability Zones.

修改健康使用AWS控制台检查目标组的设置,选择目标组,然后编辑运行状况检查

To modify health check settings for a target group using the AWS Console, choose Target Groups, and edit Health checks.

在ASG(EC2的 Auto Scaling组)中,检查 Details (用于 Termination Policies ),活动历史记录( (用于终止消息),实例(用于其健康状态),计划的操作扩展策略

In ASG (EC2's Auto Scaling group), check Details (for Termination Policies), Activity History (for termination messages), Instances (for their Health Status), Scheduled Actions and Scaling Policies.

这篇关于我如何找到导致EC2自动扩展组“运行状况检查”的原因?失败? (不涉及负载均衡器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆