来自执行者的PySpark日志记录 [英] PySpark logging from the executor
问题描述
在执行程序上使用pyspark访问Spark的log4j记录器的正确方法是什么?
What is the correct way to access the log4j logger of Spark using pyspark on an executor?
在驱动程序中这样做很容易,但是我似乎无法理解如何访问执行程序上的日志记录功能,这样我就可以在本地登录并让YARN收集本地日志.
It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.
有什么方法可以访问本地记录器?
Is there any way to access the local logger?
标准的日志记录过程还不够,因为我无法从执行程序访问spark上下文.
The standard logging procedure is not enough because I cannot access the spark context from the executor.
推荐答案
您不能在执行程序上使用本地log4j记录器.执行器jvms产生的Python工人与Java没有回调"连接,他们只接收命令.但是有一种方法可以使用标准的python日志记录从执行程序中记录并由YARN捕获.
You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.
在您的HDFS上放置python模块文件,该文件配置每个python worker一次日志记录并代理日志记录功能(将其命名为logger.py
):
On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py
):
import os
import logging
import sys
class YarnLogger:
@staticmethod
def setup_logger():
if not 'LOG_DIRS' in os.environ:
sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled')
return
file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log'
logging.basicConfig(filename=file, level=logging.INFO,
format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')
def __getattr__(self, key):
return getattr(logging, key)
YarnLogger.setup_logger()
然后将该模块导入您的应用程序内:
Then import this module inside your application:
spark.sparkContext.addPyFile('hdfs:///path/to/logger.py')
import logger
logger = logger.YarnLogger()
您可以在pyspark函数内部使用常规日志记录库之类的
And you can use in inside your pyspark functions like normal logging library:
def map_sth(s):
logger.info("Mapping " + str(s))
return s
spark.range(10).rdd.map(map_sth).count()
pyspark.log
将在资源管理器上可见,并在应用程序完成时收集,因此您以后可以使用yarn logs -applicationId ....
访问这些日志.
The pyspark.log
will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId ....
.
这篇关于来自执行者的PySpark日志记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!