AWS ECS服务任务被替换为(原因请求已超时) [英] AWS ECS service Tasks getting replaced with (reason Request timed out)

查看：104 发布时间：2020/8/23 0:52:51 amazon-web-services docker timeout amazon-ecs

本文介绍了AWS ECS服务任务被替换为(原因请求已超时)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们将ECS作为容器编排层运行了2年以上.但是有一个问题我们无法弄清原因，在我们的(node.js)很少的服务中，我们已经开始观察ECS事件中的错误

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

这会使我们的从属服务开始遇到504网关超时，这会对它们产生很大的影响.

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.

将Docker存储驱动程序从devicemapper升级到overlay2

Upgraded Docker storage driver from devicemapper to overlay2

我们增加了所有ECS实例的资源，包括在几个容器中看到的CPU，RAM和EBS存储.

We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.

我们将服务的健康检查宽限期从0秒增加到240秒

We increase health check grace period for the service from 0 to 240secs

KeepAliveTimeout和SocketTimeout增加到180秒

Increased KeepAliveTimeout and SocketTimeout to 180 secs

在容器上启用了awslog，而不是stdout，但是没有异常行为

Enabled awslogs on containers instead of stdout, but there was no unusual behavior

在容器上启用ECSMetaData并以管道方式传递应用程序日志中的所有信息.这有助于我们仅在有问题的容器中查找所有日志.

Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.

启用了容器见解以进行更好的容器级调试

Enabled container insights for better container level debugging

如果将devicemapper升级到overlay2存储驱动程序并增加运行状况检查宽限期，这对帮助最大.

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.

这两个错误的数量惊人地下降了，但是我们偶尔还是会遇到这个问题.

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.

我们已经看到与实例和容器相关的所有图表，下面是它们的日志:

We have seen all the graphs related to instance and container which went down below are the logs for it:

受害容器的ECS容器洞察日志:

ECS container insights logs for victim container :

查询:

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

回答的示例日志:

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

没有日志的CPU和内存利用率高得离谱.

None of logs were having CPU and Memory utilised ridiculously high.

我们在t1时停止从受害容器中获取响应，在t1 + 2mins时我们在依赖服务中出错，并且在t1 + 3mins时ECS删除了容器

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins

我们的健康检查配置如下:

Our health check configurations are below :

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

如果您需要更多信息，请告诉我，我们将很乐意提供.我们正在运行的配置是:

Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

应该有一些有关资源争用或服务崩溃或真正的网络故障的迹象来解释所有这一切.但是正如所提到的，我们什么都不知道会引起任何问题.

There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

AWS ECS服务任务被替换为(原因请求已超时) [英] AWS ECS service Tasks getting replaced with (reason Request timed out)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWS ECS服务任务被替换为(原因请求已超时) [英] AWS ECS service Tasks getting replaced with (reason Request timed out)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭