从ECS Docker容器写入cloudwatch时缺少日志行 [英] Missing log lines when writing to cloudwatch from ECS Docker containers

查看:123
本文介绍了从ECS Docker容器写入cloudwatch时缺少日志行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(在将所有日志打印到CloudWatch Logs之前退出AWS-ECS上的Docker容器) 为什么CloudWatch Logs Group的某些流不完整(即Fargate Docker Container成功退出但日志突然突然停止更新)?几乎在所有日志组中都间歇性地看到此消息,但并不是在每个日志流/任务运行时都看到这一点.我正在1.3.0版上运行


描述:
Dockerfile使用CMD命令运行node.js或Python脚本.

这些不是服务器/长时间运行的进程,我的用例要求容器在任务完成时退出.

示例Dockerfile:

FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]


使用docker run在本地终端上运行此命令时,所有日志均正确打印到我的终端的stdout/stderr.
要在Fargate上将其作为ECS任务运行,请从CloudFormation模板中将其日志驱动程序设置为awslogs.

...
LogConfiguration:
   LogDriver: 'awslogs'
     Options:
        awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
        awslogs-region: !Ref AWS::Region
        awslogs-stream-prefix: ecs
...

看到cloduwatch日志有时输出不完整,我进行了测试并检查了 CW日志限制,可以确定问题不存在.
我最初以为这是节点js在刷新console.log()之前异步退出的问题,或者进程退出的太早了,但是当我也使用另一种语言时也会出现相同的问题-这使我相信这不是代码的问题,而是专门针对cloudwatch的问题.
通过添加睡眠计时器导致代码延迟对我不起作用 .

可能是因为docker容器在任务完成后立即退出,所以日志没有足够的时间写到CWLogs,但是必须有一种方法来确保不会发生这种情况?

样本日志: 流不完整:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

完整的日志流:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading

解决方案

更新:现在看来是 https://docs.aws .amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html )

因此,可能的解释是,日志事件由代理缓冲,但在ECS任务停止时尚未发布. (如果是这样,这似乎是一个ECS问题-任何AWS ECS工程师愿意对此发表看法吗??)

似乎没有直接的方法可以确保日志被发布,但是它确实建议人们可以等待至少buffer_duration秒(默认为5秒),并且任何先前的日志都应该发布./p>

我将在下面描述一些测试,这是我采用的解决方法. Shell脚本run_then_wait.sh包装命令以触发Python脚本,并在脚本完成后添加睡眠.

Dockerfile

FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .

# The original command
# ENTRYPOINT ["python", "main.py"]

# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]

run_then_wait.sh

 #!/bin/sh
set -e

# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT

# Run the given command
"$@"
 

main.py

 import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    # After testing some random values, had most luck to induce the
    # issue by sleeping 9 seconds here; would occur ~30% of the time
    time.sleep(9)
    logger.info("Hello world")
 

希望该方法可以适应您的情况.您也可以在脚本中实现睡眠,但是要确保它不管终止如何发生都比较棘手.

很难证明建议的解释是正确的,因此我使用上面的代码测试了解决方法是否有效.测试是原始命令与run_then_wait.sh的对比,每次运行30次.结果是,分别在30%的时间和0%的时间观察到此问题.希望这对您同样有效!

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs) Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0


Description:
A Dockerfile runs node.js or Python scripts using the CMD command.

These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.

Sample Dockerfile:

FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]


All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.

...
LogConfiguration:
   LogDriver: 'awslogs'
     Options:
        awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
        awslogs-region: !Ref AWS::Region
        awslogs-stream-prefix: ecs
...

Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.

It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?

sample logs: incomplete stream:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

completed log stream:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading

解决方案

UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below


I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!

I think it's due to CloudWatch Logs Agent publishing log events in batches:

How are log events batched?

A batch becomes full and is published when any of the following conditions are met:

  1. The buffer_duration amount of time has passed since the first log event was added.

  2. Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.

  3. The number of log events has reached batch_count.

  4. Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.

(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)

So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)

There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.

With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.

Dockerfile

FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .

# The original command
# ENTRYPOINT ["python", "main.py"]

# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]

run_then_wait.sh

#!/bin/sh
set -e

# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT

# Run the given command
"$@"

main.py

import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    # After testing some random values, had most luck to induce the
    # issue by sleeping 9 seconds here; would occur ~30% of the time
    time.sleep(9)
    logger.info("Hello world")

Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.

It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!

这篇关于从ECS Docker容器写入cloudwatch时缺少日志行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆