解决Apache Spark中的依赖性问题 [英] Resolving dependency problems in Apache Spark

查看:333
本文介绍了解决Apache Spark中的依赖性问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

构建和部署Spark应用程序时的常见问题是:

The common problems when building and deploying Spark applications are:


  • java.lang.ClassNotFoundException

  • 对象x不是包y的成员编译错误。

  • java.lang.NoSuchMethodError

  • java.lang.ClassNotFoundException.
  • object x is not a member of package y compilation errors.
  • java.lang.NoSuchMethodError

如何解决这些问题?

推荐答案

构建和部署Spark应用程序时,所有依赖项都需要兼容版本。

When building and deploying Spark applications all dependencies require compatible versions.


  • Scala版本。所有包都必须使用相同的主要(2.10,2.11,2.12)Scala版本。

  • Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.

考虑以下(不正确) build.sbt

name := "Simple Project"

version := "1.0"

libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % "2.0.1",
   "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
   "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)

我们对Scala 2.10使用 spark-streaming 包适用于Scala 2.11。 有效文件可以

We use spark-streaming for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could be

name := "Simple Project"

version := "1.0"

libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % "2.0.1",
   "org.apache.spark" % "spark-streaming_2.11" % "2.0.1",
   "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)

但最好全局指定版本并使用 %%

but it is better to specify version globally and use %%:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-core" % "2.0.1",
   "org.apache.spark" %% "spark-streaming" % "2.0.1",
   "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.1"
)

同样在Maven:

<project>
  <groupId>com.example</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <properties>
    <spark.version>2.0.1</spark.version>
  </properties> 
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency> 
    <dependency>
      <groupId>org.apache.bahir</groupId>
      <artifactId>spark-streaming-twitter_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
  </dependencies>
</project>


  • Spark版本所有包都必须使用相同的主要Spark版本(1.6,2.0,2.1,...)。

  • Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).

    考虑以下(不正确)build.sbt:

    Consider following (incorrect) build.sbt:

    name := "Simple Project"
    
    version := "1.0"
    
    libraryDependencies ++= Seq(
       "org.apache.spark" % "spark-core_2.11" % "1.6.1",
       "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
       "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )
    

    我们使用 spark-core 1.6,而其余组件都在Spark 2.0中。 有效文件可以

    We use spark-core 1.6 while remaining components are in Spark 2.0. A valid file could be

    name := "Simple Project"
    
    version := "1.0"
    
    libraryDependencies ++= Seq(
       "org.apache.spark" % "spark-core_2.11" % "2.0.1",
       "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
       "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )
    

    但最好使用变量:

    name := "Simple Project"
    
    version := "1.0"
    
    val sparkVersion = "2.0.1"
    
    libraryDependencies ++= Seq(
       "org.apache.spark" % "spark-core_2.11" % sparkVersion,
       "org.apache.spark" % "spark-streaming_2.10" % sparkVersion,
       "org.apache.bahir" % "spark-streaming-twitter_2.11" % sparkVersion
    )
    

    同样在Maven中:

    <project>
      <groupId>com.example</groupId>
      <artifactId>simple-project</artifactId>
      <modelVersion>4.0.0</modelVersion>
      <name>Simple Project</name>
      <packaging>jar</packaging>
      <version>1.0</version>
      <properties>
        <spark.version>2.0.1</spark.version>
        <scala.version>2.11</scala.version>
      </properties> 
      <dependencies>
        <dependency> <!-- Spark dependency -->
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_${scala.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming_${scala.version}</artifactId>
          <version>${spark.version}</version>
        </dependency> 
        <dependency>
          <groupId>org.apache.bahir</groupId>
          <artifactId>spark-streaming-twitter_${scala.version}</artifactId>
          <version>${spark.version}</version>
        </dependency>
      </dependencies>
    </project>
    


  • Spark依赖项中使用的Spark版本必须与Spark安装的Spark版本匹配。例如,如果在群集上使用1.6.1,则必须使用1.6.1来构建jar。并不总是接受次要版本不匹配。

  • Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.

    用于构建jar的Scala版本必须与用于构建部署Spark的Scala版本相匹配。默认情况下(可下载的二进制文件和默认版本):

    Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):


    • Spark 1.x - > Scala 2.10

    • Spark 2.x - > Scala 2.11

    如果包含在胖子jar中,应该可以在工作节点上访问其他包。有很多选项,包括:

    Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:


    • - jars <$ c的参数$ c> spark-submit - 分发本地 jar 文件。

    • --packages spark-submit 的参数 - 从Maven存储库中获取依赖项。

    • --jars argument for spark-submit - to distribute local jar files.
    • --packages argument for spark-submit - to fetch dependencies from Maven repository.

    在群集节点中提交时,您应该在中包含应用程序 jar - jars

    When submitting in the cluster node you should include application jar in --jars.

    这篇关于解决Apache Spark中的依赖性问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆