解决 Apache Spark 中的依赖问题 [英] Resolving dependency problems in Apache Spark

查看:31
本文介绍了解决 Apache Spark 中的依赖问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

构建和部署 Spark 应用程序时的常见问题是:

The common problems when building and deploying Spark applications are:

  • java.lang.ClassNotFoundException.
  • 对象 x 不是包 y 的成员 编译错误.
  • java.lang.NoSuchMethodError
  • java.lang.ClassNotFoundException.
  • object x is not a member of package y compilation errors.
  • java.lang.NoSuchMethodError

如何解决这些问题?

推荐答案

在构建和部署 Spark 应用程序时,所有依赖项都需要兼容的版本.

When building and deploying Spark applications all dependencies require compatible versions.

  • Scala 版本.所有软件包都必须使用相同的主要(2.10、2.11、2.12)Scala 版本.

  • Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.

考虑以下(不正确的)build.sbt:

Consider following (incorrect) build.sbt:

name := "Simple Project"

version := "1.0"

libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % "2.0.1",
   "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
   "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)

我们为 Scala 2.10 使用 spark-streaming,而其余的包用于 Scala 2.11.有效文件可以是

We use spark-streaming for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could be

name := "Simple Project"

version := "1.0"

libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % "2.0.1",
   "org.apache.spark" % "spark-streaming_2.11" % "2.0.1",
   "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
)

但最好全局指定版本并使用%%(它为您附加了scala版本):

but it is better to specify version globally and use %% (which appends the scala version for you):

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-core" % "2.0.1",
   "org.apache.spark" %% "spark-streaming" % "2.0.1",
   "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.1"
)

    <project>
      <groupId>com.example</groupId>
      <artifactId>simple-project</artifactId>
      <modelVersion>4.0.0</modelVersion>
      <name>Simple Project</name>
      <packaging>jar</packaging>
      <version>1.0</version>
      <properties>
        <spark.version>2.0.1</spark.version>
      </properties> 
      <dependencies>
        <dependency> <!-- Spark dependency -->
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.11</artifactId>
          <version>${spark.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming_2.11</artifactId>
          <version>${spark.version}</version>
        </dependency> 
        <dependency>
          <groupId>org.apache.bahir</groupId>
          <artifactId>spark-streaming-twitter_2.11</artifactId>
          <version>${spark.version}</version>
        </dependency>
      </dependencies>
    </project>

  • Spark 版本所有软件包都必须使用相同的主要 Spark 版本(1.6、2.0、2.1、...).

    • Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).

      考虑遵循(不正确)build.sbt:

      Consider following (incorrect) build.sbt:

      name := "Simple Project"
      
      version := "1.0"
      
      libraryDependencies ++= Seq(
         "org.apache.spark" % "spark-core_2.11" % "1.6.1",
         "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
         "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
      )
      

      我们使用 spark-core 1.6,而其余组件在 Spark 2.0 中.有效文件可以是

      We use spark-core 1.6 while remaining components are in Spark 2.0. A valid file could be

      name := "Simple Project"
      
      version := "1.0"
      
      libraryDependencies ++= Seq(
         "org.apache.spark" % "spark-core_2.11" % "2.0.1",
         "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
         "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
      )
      

      但最好使用变量(仍然不正确):

      but it is better to use a variable (still incorrect):

      name := "Simple Project"
      
      version := "1.0"
      
      val sparkVersion = "2.0.1"
      
      libraryDependencies ++= Seq(
         "org.apache.spark" % "spark-core_2.11" % sparkVersion,
         "org.apache.spark" % "spark-streaming_2.10" % sparkVersion,
         "org.apache.bahir" % "spark-streaming-twitter_2.11" % sparkVersion
      )
      

    •     <project>
            <groupId>com.example</groupId>
            <artifactId>simple-project</artifactId>
            <modelVersion>4.0.0</modelVersion>
            <name>Simple Project</name>
            <packaging>jar</packaging>
            <version>1.0</version>
            <properties>
              <spark.version>2.0.1</spark.version>
              <scala.version>2.11</scala.version>
            </properties> 
            <dependencies>
              <dependency> <!-- Spark dependency -->
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_${scala.version}</artifactId>
                <version>${spark.version}</version>
              </dependency>
              <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_${scala.version}</artifactId>
                <version>${spark.version}</version>
              </dependency> 
              <dependency>
                <groupId>org.apache.bahir</groupId>
                <artifactId>spark-streaming-twitter_${scala.version}</artifactId>
                <version>${spark.version}</version>
              </dependency>
            </dependencies>
          </project>
      

      • Spark 依赖项中使用的 Spark 版本必须与 Spark 安装的 Spark 版本相匹配.例如如果您在集群上使用 1.6.1,则必须使用 1.6.1 来构建 jars.次要版本不匹配并不总是被接受.

        • Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.

          用于构建 jar 的 Scala 版本必须与用于构建已部署 Spark 的 Scala 版本相匹配.默认情况下(可下载的二进制文件和默认构建):

          Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):

          • Spark 1.x -> Scala 2.10
          • Spark 2.x -> Scala 2.11

          如果包含在 fat jar 中,应该可以在工作节点上访问其他包.有许多选项,包括:

          Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:

          • --jars spark-submit 的参数 - 分发本地 jar 文件.
          • --packages spark-submit 的参数 - 从 Maven 存储库中获取依赖项.
          • --jars argument for spark-submit - to distribute local jar files.
          • --packages argument for spark-submit - to fetch dependencies from Maven repository.

          在集群节点提交时,你应该在--jars中包含应用jar.

          When submitting in the cluster node you should include application jar in --jars.

          这篇关于解决 Apache Spark 中的依赖问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆