如何在本地调试Spark应用程序? [英] How to debug Spark application locally?

查看:390
本文介绍了如何在本地调试Spark应用程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想逐步学习Spark,并想知道如何在本地调试Spark应用程序?有人可以详细说明执行此操作所需的步骤吗?

I would like to learn Spark step by step and wonder how to debug a Spark application locally? Could anyone please detail the steps needed to do this?

我可以从命令行在本地Spark网站上运行simpleApp,但我只需要逐步检查代码并查看其工作方式.

I can run the simpleApp on the spark website locally from the command line but I just need to step through the code and see how it works.

推荐答案

这是我使用IntelliJ的方法.

Here's how I do it using IntelliJ.

首先,请确保您可以使用spark-submit在本地运行spark应用程序,例如像这样:

First, make sure you can run your spark application locally using spark-submit, e.g. something like:

spark-submit --class MyMainClass myapplication.jar

然后,通过添加如下所示的选项,告诉本地Spark驱动程序暂停并等待调试器启动时的连接:

Then, tell your local spark driver to pause and wait for a connection from a debugger when it starts up, by adding an option like the following:

--conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

其中agentlib:jdwp是Java调试线协议(JDWP)选项,后跟逗号分隔的子选项列表:

where agentlib:jdwp is the Java Debug Wire Protocol (JDWP) option, followed by a comma-separated list of sub-options:

  • transport定义了调试器和被调试对象之间使用的连接协议-套接字或共享内存"-您几乎总是想要套接字(dt_socket),除非我相信在某些情况下会在Microsoft Windows上使用
  • server在与调试器(或相反,客户端)交谈时,此进程是否应为服务器-您始终需要一台服务器和一个客户端.在这种情况下,我们将成为服务器并等待调试器的连接
  • suspend是否暂停执行,直到调试器成功连接为止.我们将其打开,以便驱动程序在调试器连接之前不会启动
  • address在这里,这是要侦听的端口(用于传入的调试器连接请求).您可以将其设置为任何可用的端口(只需确保将调试器配置为连接到该相同的端口即可)
  • transport defines the connection protocol used between debugger and debuggee -- either socket or "shared memory" -- you almost always want socket (dt_socket) except I believe in some cases on Microsoft Windows
  • server whether this process should be the server when talking to the debugger (or conversely, the client) -- you always need one server and one client. In this case, we're going to be the server and wait for a connection from the debugger
  • suspend whether to pause execution until a debugger has successfully connected. We turn this on so the driver won't start until the debugger connects
  • address here, this is the port to listen on (for incoming debugger connection requests). You can set it to any available port (you just have to make sure the debugger is configured to connect to this same port)

所以现在,您的spark-submit命令行应类似于:

So now, your spark-submit command line should look something like:

spark-submit \
  --name MyApp \
  --class MyMainClass \
  --conf spark.driver.extraJavaOptions=agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

现在,如果运行以上命令,您应该会看到类似

Now if you run the above, you should see something like

Listening for transport dt_socket at address: 5005

并且您的spark应用程序正在等待调试器附加.

and your spark application is waiting for the debugger to attach.

接下来,打开包含您的Spark应用程序的IntelliJ项目,然后打开运行->编辑配置...",然后单击"+"添加新的运行/调试配置,然后选择远程".给它起个名字,例如"SparkLocal",然后选择"Socket"作为传输",选择"Attach"作为调试器模式,并为"Host"键入"localhost",为上面的端口键入端口(在本例中为"5005").点击确定"保存.

Next, open the IntelliJ project containing your Spark application, and then open "Run -> Edit Configurations..." Then click the "+" to add a new run/debug configuration, and select "Remote". Give it a name, e.g. "SparkLocal", and select "Socket" for Transport, "Attach" for Debugger mode, and type in "localhost" for Host and the port you used above for Port, in this case, "5005". Click "OK" to save.

在我的IntelliJ版本中,它为您提供了用于调试过程的调试命令行的建议,并且使用了"suspend = n"-我们忽略了这一点,而使用了"suspend = y"(如上所述)因为我们希望应用程序能够等到我们开始连接.

In my version of IntelliJ it gives you suggestions for the debug command line to use for the debugged process, and it uses "suspend=n" -- we're ignoring that and using "suspend=y" (as above) because we want the application to wait until we connect to start.

现在,您应该已经准备好进行调试.只需使用上述命令启动spark,然后选择您刚创建的IntelliJ运行配置,然后单击Debug. IntelliJ应该连接到您的Spark应用程序,该应用程序现在应该开始运行.您可以设置断点,检查变量等.

Now you should be ready to debug. Simply start spark with the above command, then select the IntelliJ run configuration you just created and click Debug. IntelliJ should connect to your Spark application, which should now start running. You can set break points, inspect variables, etc.

使用spark-shell只需按以下方式导出SPARK_SUBMIT_OPTS:

With spark-shell simply export SPARK_SUBMIT_OPTS as follows:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

使用调试器(例如IntelliJ IDEA)并导入Spark源,将其附加到localhost:5005,您应该能够很好地完成代码.

Attach to localhost:5005 using your debugger (e.g. IntelliJ IDEA) and with the Spark sources imported, you should be able to step through the code just fine.

这篇关于如何在本地调试Spark应用程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆