在带有其他文件的YARN群集上运行Spark作业 [英] Running Spark jobs on a YARN cluster with additional files

查看:254
本文介绍了在带有其他文件的YARN群集上运行Spark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的Spark应用程序,它使用一些输入RDD,通过管道将其发送到外部脚本,并将该脚本的输出写入文件。驱动程序代码如下所示:

  val input = args(0)
val scriptPath = args(1)
val output = args(2)
val sc = getSparkContext
if(args.length == 4){
//在这里我传递一个附加参数,其中包含绝对路径脚本在本地机器上,仅用于本地测试
sc.addFile(args(3))
}

sc.textFile(input).pipe(Seq(python2 ,SparkFiles.get(scriptPath)))。saveAsTextFile(输出)

当我在本地机器工作正常。但是,当我通过

 将其提交给YARN集群时,spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path / to / driver.jar path / to / input / part- * test.py path / to / output` 

 c>在阶段0.0(TID 1,rwds2.1dmp.ru)中丢失任务1.0:java.lang.Exception:退出状态2的子进程

我尝试过不同的管道命令。例如, .pipe(cat)可以正常工作,其行为与预期相同,但 .pipe(Seq(cat,scriptPath)) 也会失败,错误代码为1,所以看起来Spark在集群节点上找不到脚本的路径。



任何建议?

解决方案

我自己不使用python,但我发现一些线索可能对您有用(在源代码 Spark-1.3 SparkSubmitArguments


  • --py-files PY_FILES 放置在Python应用程序的PYTHONPATH中的.zip,.egg或.py文件的逗号分隔列表。


  • - files FILES 以逗号分隔的文件列表将放置在每个执行程序的工作目录中。

  • 拱ives ARCHIVES ,逗号分隔的档案列表将被提取到每个执行者的工作目录中。


此外,您对 spark-submit 的参数应遵循以下风格:

用法:spark-submit [options]< app jar | python档案> [app arguments]


I'm writing a simple spark application that uses some input RDD, sends it to an external script via pipe, and writes an output of that script to a file. Driver code looks like this:

val input = args(0)
val scriptPath = args(1)
val output = args(2)
val sc = getSparkContext
if (args.length == 4) {
  //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing
  sc.addFile(args(3))
}

sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)

When I run it on my local machine it works fine. But when I submit it to a YARN cluster via

spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output` 

it fails with an exception.

Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2

I've tried different variations of the pipe command. For instance, .pipe("cat") works fine, and behaves as expected, but .pipe(Seq("cat", scriptPath)) also fails with error code 1, so it seems that spark can't figure out a path to the script on a cluster node.

Any suggestions?

解决方案

I don't use python myself but I find some clues may be useful for you (in the source code of Spark-1.3 SparkSubmitArguments)

  • --py-files PY_FILES, Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

  • --files FILES, Comma-separated list of files to be placed in the working directory of each executor.

  • --archives ARCHIVES, Comma separated list of archives to be extracted into the working directory of each executor.

And also, your arguments to spark-submit should follow this style:

Usage: spark-submit [options] <app jar | python file> [app arguments]

这篇关于在带有其他文件的YARN群集上运行Spark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆