--repositories --packages选项的spark-submit classpath问题 [英] spark-submit classpath issue with --repositories --packages options
问题描述
我在一个独立的群集中运行Spark,在该群集中,火花主机,工作程序和提交的每个运行都在自己的Docker容器中.
I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
当我的Java应用程序spark-submit
具有--repositories
和--packages
选项时,我可以看到它成功下载了应用程序所需的依赖项.但是,stderr
登录到spark worker Web ui上会报告一个java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
.此类在spark-submit
下载的依赖项之一中可用.但看起来工人类路径上没有它?
When spark-submit
my Java App with the --repositories
and --packages
options I can see that it successfully downloads the apps required dependencies. However the stderr
logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
. This class is available in one of the dependencies downloaded by spark-submit
. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
spark-submit
调用:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic
推荐答案
当我遇到此问题时,我正在使用Spark 2.4.0.我还没有解决方案,只是一些基于实验的观察结果,请阅读一些解决方案.我在此向他们指出是为了防止某些人进行调查.如果以后再找到更多信息,我将更新此答案.
I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
- 仅当必须引用某些自定义存储库时,才需要
--repositories
选项 - 默认情况下,如果未提供
--repositories
选项,则使用maven中央存储库 - 指定
--packages
选项时,submit操作将尝试在~/.ivy2/cache
,~/.ivy2/jars
,~/.m2/repository
目录中查找软件包及其相关性. - 如果找不到它们,则使用ivy从maven Central下载它们并将其存储在
~/.ivy2
目录下.
- The
--repositories
option is required only if some custom repository has to be referenced - By default the maven central repository is used if the
--repositories
option is not provided - When
--packages
option is specified, the submit operation tries to look for the packages and their dependencies in the~/.ivy2/cache
,~/.ivy2/jars
,~/.m2/repository
directories. - If they are not found, then they are downloaded from maven central using ivy and stored under the
~/.ivy2
directory.
就我而言,我已经观察到
In my case I had observed that
-
spark-shell
与--packages
选项完美配合 -
spark-submit
将无法执行相同的操作.它将正确下载依赖项,但无法将jar传递给驱动程序和工作程序节点
如果我使用 -
spark-submit
与--packages
选项一起使用. - 这将在我运行spark-submit命令的命令外壳中本地运行驱动程序,但工作程序将在具有适当依赖项jar的群集上运行
--deploy-mode client
而不是集群在本地运行驱动程序,则spark-shell
worked perfectly with the--packages
optionspark-submit
would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodesspark-submit
worked with the--packages
option if I ran the driver locally using--deploy-mode client
instead of cluster.- This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
我发现以下讨论很有用,但我仍然必须确定这个问题. https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
I found the following discussion useful but I still have to nail down this problem. https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
大多数人只是使用UBER jar来避免遇到此问题,甚至避免在平台提供相同依赖项jar的不同版本的情况下发生jar版本冲突的问题.
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
但是我不喜欢在停顿安排之外的想法,并且仍在寻找解决方案.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.
这篇关于--repositories --packages选项的spark-submit classpath问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!