在dataproc上调整Spark日志级别的最优雅,最可靠的方法是什么? [英] What is the most elegant and robust way on dataproc to adjust log levels for Spark?

查看:102
本文介绍了在dataproc上调整Spark日志级别的最优雅,最可靠的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如之前的回答所述,更改Spark集群的详细程度的理想方法是更改​​

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf

一些建议:

在dataproc上,我们有几个gcloud命令和属性,可以在集群创建过程中传递这些命令和属性. 参阅文档 是否可以通过指定

On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying

--properties 'log4j:hadoop.root.logger=WARN,console'

也许不是,就像从文档中看到的一样:

Maybe not, as from the docs:

-properties命令无法修改未显示的配置文件 以上.

The --properties command cannot modify configuration files not shown above.

另一种方法是在群集初始化期间使用shell脚本并运行sed:

Another way would be to use a shell script during cluster init and run sed:

# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
                     /etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
                    /etc/hadoop/conf/log4j.properties

但是就足够了吗?还是我们还需要更改环境变量hadoop.root.logger?

But is it enough or do we need to change the env variable hadoop.root.logger as well?

推荐答案

目前,您说对了,--properties不支持额外的log4j设置,但这确实是我们谈论添加的内容;一些注意事项包括与保持最小/简单设置(可以遍及所有内容)相比,在对Spark,Yarn和其他长期运行的守护程序的日志记录配置(hiveserver2,HDFS守护程序等)进行细粒度控制的能力之间要进行多少平衡?以共享的方式.

At the moment, you're right that --properties doesn't support extra log4j settings, but it's certainly something we've talked about adding; some considerations include how much to balance the ability to do fine-grained control over Spark vs Yarn vs other long-running daemons' logging configs (hiveserver2, HDFS daemons, etc) compared to keeping a minimal/simple setting which is plumbed through to everything in a shared way.

至少对于Spark driver 日志,您可以使用--driver-log-levels设置作业提交时间,该时间应优先于任何/etc/*/conf设置,但是如您所描述的,init操作是在群集启动时立即编辑文件的一种合理方法,请记住,它们可能会随着时间和发行版本而变化.

At least for Spark driver logs, you can use the --driver-log-levels setting a job-submission time which should take precedence over any of the /etc/*/conf settings, but otherwise as you describe, init actions are a reasonable way to edit the files for now on cluster startup, keeping in mind that they may change over time and releases.

这篇关于在dataproc上调整Spark日志级别的最优雅,最可靠的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆