带有 --repositories --packages 选项的 spark-submit 类路径问题 [英] spark-submit classpath issue with --repositories --packages options

查看:70
本文介绍了带有 --repositories --packages 选项的 spark-submit 类路径问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个独立的集群中运行 Spark,其中 spark master、worker 和提交每个运行都在自己的 Docker 容器中.

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.

spark-submit 我的 Java 应用程序带有 --repositories--packages 选项时,我可以看到它成功下载了应用程序所需的依赖项.然而,stderr 日志在 spark workers 网络用户界面上报告了 java.lang.ClassNotFoundException: kafka.serializer.StringDecoder.此类在 spark-submit 下载的依赖项之一中可用.但看起来它在工人类路径上不可用??

When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??

16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
    at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
    at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
    ... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

spark-submit 调用:

${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic

推荐答案

当我遇到这个问题时,我正在使用 Spark 2.4.0.我还没有解决方案,但只是基于实验和阅读解决方案的一些观察.我把它们记下来,以防万一它有助于他们的调查.如果我稍后找到更多信息,我会更新此答案.

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.

  • --repositories 选项仅在必须引用某些自定义存储库时才需要
  • 默认情况下,如果未提供 --repositories 选项,则使用 maven 中央存储库
  • 当指定--packages选项时,提交操作会尝试在~/.ivy2/cache中查找包及其依赖项,~/.ivy2/jars, ~/.m2/repository 目录.
  • 如果未找到,则使用 ivy 从 maven central 下载它们并存储在 ~/.ivy2 目录下.
  • The --repositories option is required only if some custom repository has to be referenced
  • By default the maven central repository is used if the --repositories option is not provided
  • When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
  • If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.

就我而言,我观察到

  • spark-shell--packages 选项完美配合
  • spark-submit 将无法执行相同的操作.它会正确下载依赖项,但无法将 jars 传递给驱动程序和工作节点
  • spark-submit 使用 --packages 选项,如果我使用 --deploy-mode client 而不是 cluster 在本地运行驱动程序.
  • 这将在我运行 spark-submit 命令的命令外壳中本地运行驱动程序,但工作程序将在具有适当依赖项 jar 的集群上运行
  • spark-shell worked perfectly with the --packages option
  • spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
  • spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
  • This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars

我发现以下讨论很有用,但我仍然需要确定这个问题.https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455

I found the following discussion useful but I still have to nail down this problem. https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455

大多数人只是使用 UBER jar 来避免遇到这个问题,甚至避免 jar 版本冲突的问题,即平台提供了相同依赖 jar 的不同版本.

Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.

但我不喜欢这个想法,除了止损安排之外,我仍在寻找解决方案.

But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

这篇关于带有 --repositories --packages 选项的 spark-submit 类路径问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆