Spark spark-submit --jars 参数需要逗号列表,如何声明 jars 目录? [英] Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?

查看:99
本文介绍了Spark spark-submit --jars 参数需要逗号列表,如何声明 jars 目录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 文档中提交应用程序,截至1.6.0 及更早版本,不清楚如何指定 --jars 参数,因为它显然不是冒号分隔的类路径,也不是目录扩展.

In Submitting Applications in the Spark docs, as of 1.6.0 and earlier, it's not clear how to specify the --jars argument, as it's apparently not a colon-separated classpath not a directory expansion.

文档说包含您的应用程序和所有依赖项的捆绑 jar 的路径.URL 必须在集群内部全局可见,例如,hdfs://路径或存在于所有节点."

The docs say "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."

问题:提交类路径的所有选项是什么?--jars 在 $SPARK_HOME/bin 中的 spark-submit 脚本中?任何未记录的可以作为文档改进提交的内容?

Question: What are all the options for submitting a classpath with --jars in the spark-submit script in $SPARK_HOME/bin? Anything undocumented that could be submitted as an improvement for docs?

我问是因为当我今天测试 --jars 时,我们必须明确提供每个 jar 的路径:

I ask because when I was testing --jars today, we had to explicitly provide a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar

我们选择在每个 worker 上用/usr/local/spark/jars 中的所有 jars 预填充集群,似乎如果没有提供 local:/file:/或 hdfs:,那么默认值为file:/并且驱动程序使 jars 在驱动程序运行的网络服务器上可用.我选择了本地,如上.

We are choosing to pre-populate the cluster with all the jars in /usr/local/spark/jars on each worker, it seemed that if no local:/ file:/ or hdfs: was supplied, then the default is file:/ and the driver makes the jars available on a webserver run by the driver. I chose local, as above.

而且似乎我们不需要将主 jar 放在 --jars 参数中,我还没有测试其他类是否在最后一个参数中(application-jar arg per docs,即/usr/local/spark/jars/thold-0.0.1-1.jar) 发送给工作人员,或者如果我需要将 application-jar 放在 --jars 路径中以获取未以 --class 命名的类以供查看.

And it seems that we do not need to put the main jar in the --jars argument, I have not tested yet if other classes in the final argument (application-jar arg per docs, i.e. /usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I need to put the application-jar in the --jars path to get classes not named after --class to be seen.

(并且使用 --deploy-mode 客户端授予 Spark 独立模式,您还必须将驱动程序的副本放在每个工作人员上,但您不知道哪个工作人员将运行驱动程序)

(And granted with Spark standalone mode using --deploy-mode client, you also have to put a copy of the driver on each worker but you don't know up front which worker will run the driver)

推荐答案

这样就很容易工作了.. 而不是单独指定每个 jar 的版本..

In this way it worked easily.. instead of specifying each jar with version separately..

#!/bin/sh
# build all other dependent jars in OTHER_JARS

JARS=`find ../lib -name '*.jar'`
OTHER_JARS=""
   for eachjarinlib in $JARS ; do    
if [ "$eachjarinlib" != "APPLICATIONJARTOBEADDEDSEPERATELY.JAR" ]; then
       OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH

spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,APPLICATIONJARTOBEADDEDSEPERATELY.JAR

  • 使用 tr unix 命令也可以帮助像下面的例子一样.

    --jars $(echo/dir_of_jars/*.jar | tr ' ' ',')

    • Using tr unix command also can help like the below example.

      --jars $(echo /dir_of_jars/*.jar | tr ' ' ',')

      这篇关于Spark spark-submit --jars 参数需要逗号列表,如何声明 jars 目录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆