如何在调试模式下调用 PySpark? [英] How can PySpark be called in debug mode?

查看:33
本文介绍了如何在调试模式下调用 PySpark?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Apache Spark 1.4 设置了 IntelliJ IDEA.

我希望能够将调试点添加到我的 Spark Python 脚本中,以便我可以轻松地调试它们.

我目前正在运行这个 Python 代码来初始化 spark 进程

proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE)如果详细:打印 proc.stdout.read()打印 proc.stderr.read()

spark-submit 最终调用 myFirstSparkScript.py 时,调试模式不会启用,它会正常执行.不幸的是,编辑 Apache Spark 源代码并运行自定义副本不是一个可接受的解决方案.

有谁知道是否可以在调试模式下让 spark-submit 调用 Apache Spark 脚本?如果是这样,如何?

解决方案

据我所知,鉴于 Spark 架构,你想要的东西是不可能直接实现的.即使没有 subprocess 调用,可以直接在驱动程序上访问的程序的唯一部分是 SparkContext.与其他部分相比,您被不同的通信层有效隔离,包括至少一个(在本地模式下)JVM 实例.为了说明这一点,让我们使用

左侧框中的内容是可在本地访问并可用于连接调试器的部分.由于它仅限于 JVM 调用,因此您真的不会感兴趣,除非您实际上是在修改 PySpark 本身.

右侧发生的事情是远程发生的,根据您使用的集群管理器,从用户的角度来看,这几乎是一个黑匣子.此外还有很多情况,右边的 Python 代码只是调用了 JVM API.

这是不好的部分.好的部分是大多数时候应该不需要远程调试.不包括像 TaskContext 这样可以轻松模拟的访问对象,您的代码的每个部分都应该可以轻松地在本地运行/测试,而无需使用任何 Spark 实例.

您传递给动作/转换的函数采用标准和可预测的 Python 对象,并且预计也会返回标准的 Python 对象.同样重要的是这些应该没有副作用

因此,在一天结束时,您必须处理程序的一部分 - 一个可以交互访问和测试的薄层,完全基于输入/输出和不需要 Spark 进行测试/调试的计算核心".

其他选项

话虽如此,您在这里并非完全没有选择.

本地模式

(被动地将调试器附加到正在运行的解释器)

普通 GDB 和 PySpark 调试器都可以附加到正在运行的进程.这只能在 PySpark 守护进程和/或工作进程启动后完成.在本地模式下,您可以通过执行一个虚拟操作来强制它,例如:

sc.parallelize([], n).count()

其中 n 是本地模式下可用的核心"数量(local[n]).类 Unix 系统上的示例过程:

  • 启动 PySpark 外壳:

    $SPARK_HOME/bin/pyspark

  • 使用pgrep检查是否有守护进程在运行:

    ➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon➜ spark-2.1.0-bin-hadoop2.7$

  • 同样的事情可以在 PyCharm 中通过以下方式确定:

    alt+shift+a 并选择附加到本地进程:

    运行 -> 附加到本地进程.

    此时您应该只看到 PySpark shell(可能还有一些不相关的进程).

  • 执行虚拟动作:

    sc.parallelize([], 1).count()

  • 现在你应该看到 daemonworker(这里只有一个):

    ➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon1399014046➜ spark-2.1.0-bin-hadoop2.7$

    具有较低 pid 的进程是一个守护进程,具有较高 pid 的进程是(可能)临时工作者.

  • 此时您可以将调试器附加到感兴趣的进程:

    • 在 PyCharm 中选择要连接的进程.
    • 使用普通 GDB 调用:

      gdb python <运行进程的pid>

这种方法的最大缺点是您在正确的时间找到了正确的口译员.

分布式模式

(使用连接到调试器服务器的活动组件)

使用 PyCharm

PyCharm 提供了

  • 启动调试服务器:

    shift+F9

    您应该看到调试器控制台:

  • 通过安装或分发 egg 文件,确保 pyddev 可在工作节点上访问.

  • 其他工具

    有许多工具,包括 python-manholepyrasite 可以通过一些努力与 PySpark 一起使用.

    注意:

    当然,您可以在本地模式下使用远程"(主动)方法,并且在某种程度上使用分布式模式下的本地"方法(您可以连接到工作节点并按照与本地模式中相同的步骤进行操作)).

    I have IntelliJ IDEA set up with Apache Spark 1.4.

    I want to be able to add debug points to my Spark Python scripts so that I can debug them easily.

    I am currently running this bit of Python to initialise the spark process

    proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE)
    
    if VERBOSE:
        print proc.stdout.read()
        print proc.stderr.read()
    

    When spark-submit eventually calls myFirstSparkScript.py, the debug mode is not engaged and it executes as normal. Unfortunately, editing the Apache Spark source code and running a customised copy is not an acceptable solution.

    Does anyone know if it is possible to have spark-submit call the Apache Spark script in debug mode? If so, how?

    解决方案

    As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without subprocess call the only part of your program that is accessible directly on a driver is a SparkContext. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation.

    What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself.

    What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API.

    This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like TaskContext, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever.

    Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free

    So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging.

    Other options

    That being said, you're not completely out of options here.

    Local mode

    (passively attach debugger to a running interpreter)

    Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:

    sc.parallelize([], n).count()
    

    where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:

    • Start PySpark shell:

      $SPARK_HOME/bin/pyspark 
      

    • Use pgrep to check there is no daemon process running:

      ➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
      ➜  spark-2.1.0-bin-hadoop2.7$
      

    • The same thing can be determined in PyCharm by:

      alt+shift+a and choosing Attach to Local Process:

      or Run -> Attach to Local Process.

      At this point you should see only PySpark shell (and possibly some unrelated processes).

    • Execute dummy action:

      sc.parallelize([], 1).count()

    • Now you should see both daemon and worker (here only one):

      ➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
      13990
      14046
      ➜  spark-2.1.0-bin-hadoop2.7$
      

      and

      The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.

    • At this point you can attach debugger to a process of interest:

      • In PyCharm by choosing the process to connect.
      • With plain GDB by calling:

        gdb python <pid of running process>
        

    The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.

    Distributed mode

    (Using active component which connects to debugger server)

    With PyCharm

    PyCharm provides Python Debug Server which can be used with PySpark jobs.

    First of all you should add a configuration for remote debugger:

    • alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
    • Click on Add new configuration (green plus) and choose Python Remote Debug.
    • Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)

    • Start debug server:

      shift+F9

      You should see debugger console:

    • Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.

    • pydevd uses an active component which has to be included in your code:

      import pydevd
      pydevd.settrace(<host name>, port=<port number>)
      

      The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:

      diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
      index 7f06d4288c..6cff353795 100644
      --- a/python/pyspark/daemon.py
      +++ b/python/pyspark/daemon.py
      @@ -44,6 +44,9 @@ def worker(sock):
           """
           Called by a worker process after the fork().
           """
      +    import pydevd
      +    pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
      +
           signal.signal(SIGHUP, SIG_DFL)
           signal.signal(SIGCHLD, SIG_DFL)
           signal.signal(SIGTERM, SIG_DFL)
      

      If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.

    • Execute PySpark code. Go back to the debugger console and have fun:

    Other tools

    There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.

    Note:

    Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

    这篇关于如何在调试模式下调用 PySpark?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆