来自执行者的PySpark日志记录 [英] PySpark logging from the executor

查看:1090
本文介绍了来自执行者的PySpark日志记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在执行程序上使用pyspark访问Spark的log4j记录器的正确方法是什么?

What is the correct way to access the log4j logger of Spark using pyspark on an executor?

在驱动程序中这样做很容易,但是我似乎无法理解如何访问执行程序上的日志记录功能,这样我就可以在本地登录并让YARN收集本地日志.

It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.

有什么方法可以访问本地记录器?

Is there any way to access the local logger?

标准的日志记录过程还不够,因为我无法从执行程序访问spark上下文.

The standard logging procedure is not enough because I cannot access the spark context from the executor.

推荐答案

您不能在执行程序上使用本地log4j记录器.执行器jvms产生的Python工人与Java没有回调"连接,他们只接收命令.但是有一种方法可以使用标准的python日志记录从执行程序中记录并由YARN捕获.

You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

在您的HDFS上放置python模块文件,该文件配置每个python worker一次日志记录并代理日志记录功能(将其命名为logger.py):

On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

import os
import logging
import sys

class YarnLogger:
    @staticmethod
    def setup_logger():
        if not 'LOG_DIRS' in os.environ:
            sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled')
            return 

        file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log'
        logging.basicConfig(filename=file, level=logging.INFO, 
                format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')

    def __getattr__(self, key):
        return getattr(logging, key)

YarnLogger.setup_logger()

然后将该模块导入您的应用程序内:

Then import this module inside your application:

spark.sparkContext.addPyFile('hdfs:///path/to/logger.py')
import logger
logger = logger.YarnLogger()

您可以在pyspark函数内部使用常规日志记录库之类的

And you can use in inside your pyspark functions like normal logging library:

def map_sth(s):
    logger.info("Mapping " + str(s))
    return s

spark.range(10).rdd.map(map_sth).count()

pyspark.log将在资源管理器上可见,并在应用程序完成时收集,因此您以后可以使用yarn logs -applicationId ....访问这些日志.

The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

这篇关于来自执行者的PySpark日志记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆