AWS EMR Spark Python日志记录 [英] AWS EMR Spark Python Logging

查看:78
本文介绍了AWS EMR Spark Python日志记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AWS EMR上运行一个非常简单的Spark作业,并且似乎无法从脚本中获取任何日志输出.

我尝试过打印到stderr:

从pyspark

 导入SparkContext导入系统如果__name__ =='__main__':sc = SparkContext(appName ="HelloWorld")打印('Hello,world!',file = sys.stderr)sc.stop() 

并使用火花记录器,如此处所示:

从pyspark

 导入SparkContext如果__name__ =='__main__':sc = SparkContext(appName ="HelloWorld")log4jLogger = sc._jvm.org.apache.log4jlogger = log4jLogger.LogManager.getLogger(__ name__)logger.error('你好,世界!')sc.stop() 

在作业运行后,EMR给了我两个日志文件: controller stderr .这两个日志均不包含您好,世界!" 字符串.据我了解, stdout 在spark中被重定向到 stderr . stderr 日志显示该作业已成功接受,运行和完成.

所以我的问题是,在哪里可以查看脚本的日志输出?还是应该更改脚本中的哪些内容才能正确登录?

我使用此命令来提交步骤:

  aws emr add-steps --region us-west-2 --cluster-id x-XXXXXXXXXXXXX --steps Type = spark,Name = HelloWorld,Args = [-deploy-mode,cluster,--master,yarn,-conf,spark.yarn.submit.waitAppCompletion = true,s3a://path/to/simplejob.py],ActionOnFailure = CONTINUE 

解决方案

我发现,针对特定步骤的EMR日志几乎永远不会出现在控制器或stderr日志中,这些日志会随AWS控制台中的步骤一起拉出.>

通常,我在作业的容器日志中找到所需的内容(通常是在stdout中).

这些通常位于类似 s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application‌_XXXXXXXXX/container‌_XXXXXXX/...这样的路径下.您可能需要在容器内的各个 application _... container _... 目录中四处浏览.

最后一个容器目录应具有 stdout.log stderr.log .

I'm running a very simple Spark job on AWS EMR and can't seem to get any log output from my script.

I've tried with printing to stderr:

from pyspark import SparkContext
import sys

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")
    print('Hello, world!', file=sys.stderr)
    sc.stop()

And using the spark logger as shown here:

from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")

    log4jLogger = sc._jvm.org.apache.log4j
    logger = log4jLogger.LogManager.getLogger(__name__)
    logger.error('Hello, world!')

    sc.stop()

EMR gives me two log files after the job runs: controller and stderr. Neither log contains the "Hello, world!" string. It's my understanding the stdout is redirected to stderr in spark. The stderr log shows that the job is accepted, run, and completed successfully.

So my question is, where can I view my script's log output? Or what should I change in my script to log correctly?

Edit: I used this command to submit the step:

aws emr add-steps --region us-west-2 --cluster-id x-XXXXXXXXXXXXX --steps Type=spark,Name=HelloWorld,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3a://path/to/simplejob.py],ActionOnFailure=CONTINUE

解决方案

I've found that EMR's logging for particular steps almost never winds up in the controller or stderr logs that get pulled alongside the step in the AWS console.

Usually I find what I want in the job's container logs (and usually it's in stdout).

These are typically at a path like s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application‌​_XXXXXXXXX/container‌​_XXXXXXX/.... You might need to poke around within the various application_... and container_... directories within containers.

That last container directory should have a stdout.log and stderr.log.

这篇关于AWS EMR Spark Python日志记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆