Spark提交---packages vs --jars [英] Spark-Submit: --packages vs --jars
问题描述
有人可以在提交火花的脚本中解释-packages
和-jars
之间的区别吗? / p>
Can someone explain the differences between --packages
and --jars
in a spark-submit script?
nohup ./bin/spark-submit --jars ./xxx/extrajars/stanford-corenlp-3.8.0.jar,./xxx/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector_2.11:2.0.7 \
--class xxx.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--master spark://192.168.0.141:7077 ./xxx/xxxanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
此外,我是否需要个软件包
配置是否在我的应用程序 pom.xml
中? (我问是因为我只是通过更改-packages
中的版本来炸毁我的应用程序,而忘记了在 pom.xml $中进行更改c $ c>)
Also, do I require the--packages
configuration if the dependency is in my applications pom.xml
? (I ask because I just blew up my applicationon by changing the version in --packages
while forgetting to change it in the pom.xml
)
我正在使用-jars
,因为这些jar很大(超过100GB)从而减慢了阴影罐的编译速度。我承认我不确定为什么要使用-packages
,除了因为我在遵循datastax文档
I am using the --jars
currently because the jars are massive (over 100GB) and thus slow down the shaded jar compilation. I admit I am not sure why I am using --packages
other than because I am following datastax documentation
推荐答案
如果执行 spark-submit --help
,它将显示:
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
如果是-罐子
然后spark不会碰到maven,但是它将在本地文件系统中搜索指定的jar,它还支持以下URL方案hdfs / http / https / ftp。
then spark doesn't hit maven but it will search specified jar in the local file system it also supports following URL scheme hdfs/http/https/ftp.
因此,如果它是-包装
然后spark将在本地Maven存储库中搜索特定的软件包,然后在Central Maven存储库或--repositories提供的任何存储库中进行搜索,然后将其下载。
then spark will search specific package in local maven repo then central maven repo or any repo provided by --repositories and then download it.
现在回到您的问题:
此外,如果依赖项在我的应用程序pom.xml中,我是否需要--packages配置?
答案:否,如果您不是直接在jar中导入/使用类,而是需要按某个类加载类加载程序或服务加载程序(例如JDBC驱动程序)。
Ans: No, If you are not importing/using classes in jar directly but need to load classes by some class loader or service loader (e.g. JDBC Drivers). Yes otherwise.
BTW,如果您在pom.xml中使用特定版本的特定jar,那么为什么不制作应用程序的uber / fat jar或提供依赖项jar在-jars 参数中?而不是使用-程序包
BTW, If you are using specific version of specific jar in your pom.xml then why dont you make uber/fat jar of your application or provide dependency jar in --jars argument ? instead of using --packages
链接来引用:
add-jars-to-a-spark-job-spark-submit
这篇关于Spark提交---packages vs --jars的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!