我如何找到导致EC2自动扩展组“运行状况检查”的原因？失败？（不涉及负载均衡器） [英] How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

查看：111 发布时间：2020/6/4 0:44:40 amazon-web-services amazon-ec2

本文介绍了我如何找到导致EC2自动扩展组“运行状况检查”的原因？失败？（不涉及负载均衡器）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的AWS自动伸缩组中的EC2实例在运行1-4小时后都终止。确切的时间会有所不同，但是一旦发生，整个团队就会在几分钟之内崩溃。

The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.

每个组的扩展历史描述很简单：

The scaling history description for each is simply:

在2016-08-26T05：21：04Z，实例因响应EC2运行状况检查而被终止服务，表明该实例已终止或停止。 / p>

At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.

但是我还没有添加任何健康检查。

But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.

我如何确定此运行状况检查失败的实际含义是什么？

How do I determine what this "health check" failure actually means?

有关ASG终止的大多数问题都引回到负载均衡器，但是我没有负载均衡器。该集群处理批处理作业，并且最小/最大/期望值由软件根据系统中其他位置的工作量积压进行控制。

Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.

ASG历史记录并不表明发生了放大事件，并且实例也受到了明确保护，无法进行放大。

The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.

我尝试将运行状况检查宽限期设置为20小时，以查看是否至少可以使实例停止运行，以便我可以对其进行检查，但它们仍然会终止。

I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.

实例正在运行ECS AMI，并且ECS在容器中运行启动时启动的单个任务。该任务的日志看起来很正常，并且看起来运行良好，直到实例消失前几分钟。

The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.

该任务占用大量CPU，但是当该任务仍然发生时，仍会发生错误我只睡了六个小时。

The task is CPU intensive, but error occurs still when I just have it sleep for six hours.

我如何找到导致EC2自动扩展组“运行状况检查”的原因？失败？（不涉及负载均衡器） [英] How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我如何找到导致EC2自动扩展组“运行状况检查”的原因？失败？ （不涉及负载均衡器） [英] How do I find the cause of an EC2 autoscaling group &quot;health check&quot; failure? (no load balancer involved)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

我如何找到导致EC2自动扩展组“运行状况检查”的原因？失败？（不涉及负载均衡器） [英] How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

登录关闭