在dataproc上调整Spark日志级别的最优雅,最可靠的方法是什么? [英] What is the most elegant and robust way on dataproc to adjust log levels for Spark?
问题描述
如之前的回答所述,更改Spark集群的详细程度的理想方法是更改
As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf
一些建议:
在dataproc上,我们有几个gcloud命令和属性,可以在集群创建过程中传递这些命令和属性. 参阅文档 是否可以通过指定
On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying
--properties 'log4j:hadoop.root.logger=WARN,console'
也许不是,就像从文档中看到的一样:
Maybe not, as from the docs:
-properties命令无法修改未显示的配置文件 以上.
The --properties command cannot modify configuration files not shown above.
另一种方法是在群集初始化期间使用shell脚本并运行sed:
Another way would be to use a shell script during cluster init and run sed:
# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
/etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
/etc/hadoop/conf/log4j.properties
但是就足够了吗?还是我们还需要更改环境变量hadoop.root.logger?
But is it enough or do we need to change the env variable hadoop.root.logger as well?
推荐答案
目前,您说对了,--properties
不支持额外的log4j设置,但这确实是我们谈论添加的内容;一些注意事项包括与保持最小/简单设置(可以遍及所有内容)相比,在对Spark,Yarn和其他长期运行的守护程序的日志记录配置(hiveserver2,HDFS守护程序等)进行细粒度控制的能力之间要进行多少平衡?以共享的方式.
At the moment, you're right that --properties
doesn't support extra log4j settings, but it's certainly something we've talked about adding; some considerations include how much to balance the ability to do fine-grained control over Spark vs Yarn vs other long-running daemons' logging configs (hiveserver2, HDFS daemons, etc) compared to keeping a minimal/simple setting which is plumbed through to everything in a shared way.
至少对于Spark driver 日志,您可以使用--driver-log-levels
设置作业提交时间,该时间应优先于任何/etc/*/conf
设置,但是如您所描述的,init操作是在群集启动时立即编辑文件的一种合理方法,请记住,它们可能会随着时间和发行版本而变化.
At least for Spark driver logs, you can use the --driver-log-levels
setting a job-submission time which should take precedence over any of the /etc/*/conf
settings, but otherwise as you describe, init actions are a reasonable way to edit the files for now on cluster startup, keeping in mind that they may change over time and releases.
这篇关于在dataproc上调整Spark日志级别的最优雅,最可靠的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!