AWS ECS服务任务被替换为(原因请求已超时) [英] AWS ECS service Tasks getting replaced with (reason Request timed out)

查看:104
本文介绍了AWS ECS服务任务被替换为(原因请求已超时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们将ECS作为容器编排层运行了2年以上.但是有一个问题我们无法弄清原因,在我们的(node.js)很少的服务中,我们已经开始观察ECS事件中的错误

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

这会使我们的从属服务开始遇到504网关超时,这会对它们产生很大的影响.

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.

  1. 将Docker存储驱动程序从devicemapper升级到overlay2

  1. Upgraded Docker storage driver from devicemapper to overlay2

我们增加了所有ECS实例的资源,包括在几个容器中看到的CPU,RAM和EBS存储.

We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.

我们将服务的健康检查宽限期从0秒增加到240秒

We increase health check grace period for the service from 0 to 240secs

KeepAliveTimeout和SocketTimeout增加到180秒

Increased KeepAliveTimeout and SocketTimeout to 180 secs

在容器上启用了awslog,而不是stdout,但是没有异常行为

Enabled awslogs on containers instead of stdout, but there was no unusual behavior

在容器上启用ECSMetaData并以管道方式传递应用程序日志中的所有信息.这有助于我们仅在有问题的容器中查找所有日志.

Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.

启用了容器见解以进行更好的容器级调试

Enabled container insights for better container level debugging

如果将devicemapper升级到overlay2存储驱动程序并增加运行状况检查宽限期,这对帮助最大.

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.

这两个错误的数量惊人地下降了,但是我们偶尔还是会遇到这个问题.

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.

我们已经看到与实例和容器相关的所有图表,下面是它们的日志:

We have seen all the graphs related to instance and container which went down below are the logs for it:

受害容器的ECS容器洞察日志:

ECS container insights logs for victim container :

查询:

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

回答的示例日志:

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

没有日志的CPU和内存利用率高得离谱.

None of logs were having CPU and Memory utilised ridiculously high.

我们在t1时停止从受害容器中获取响应,在t1 + 2mins时我们在依赖服务中出错,并且在t1 + 3mins时ECS删除了容器

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins

我们的健康检查配置如下:

Our health check configurations are below :

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

如果您需要更多信息,请告诉我,我们将很乐意提供.我们正在运行的配置是:

Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

应该有一些有关资源争用或服务崩溃或真正的网络故障的迹象来解释所有这一切.但是正如所提到的,我们什么都不知道会引起任何问题.

There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

推荐答案

从1到7的步骤几乎与该错误无关.

Your steps from 1 to 7 almost no thing do with the error.

服务示例服务(实例i-016b0a460d9974567)(端口1047)是 由于(原因请求已计时),目标组示例服务中的服务不健康 出)

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

错误非常清楚,您无法通过负载均衡器运行状况检查访问ECS服务.

The error is very clear, you ECS service is not reachable to Load balancer health check.

目标群体不健康

在这种情况下,请直接检查容器SG,端口,应用程序状态或运行状况状态代码.

When this is the case, go straight and check the container SG, Port, application status or health status code.

可能的原因

  • 可能是这种情况,后端服务中没有路由Path /healthcheck
  • /healthcheck中的状态代码不是200
  • 在目标端口无效的情况下,请正确配置它,如果在端口8080或3000上运行的应用程序应为30008080
  • 安全组不允许目标组上的流量
  • 应用程序未在容器中运行
  • There might be the case, there is no route Path /healthcheck in the backend service
  • The status code from /healthcheck is not 200
  • Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
  • The security group is not allowing traffic on the target group
  • Application is not running in the container

这些是健康检查超时的可能原因.

These are the possible reason when there is a timeout from health check.

这篇关于AWS ECS服务任务被替换为(原因请求已超时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆