排除CDH中spark-core的依赖 [英] Exclusion of dependency of spark-core in CDH

查看:54
本文介绍了排除CDH中spark-core的依赖的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Structured Spark Streaming 写入来自 Kafka 的 HBase 数据.

I'm using Structured Spark Streaming to write to HBase data coming from Kafka.

我的集群分布是:Hadoop 3.0.0-cdh6.2.0,我使用的是 Spark 2.4.0

My cluster distribution is : Hadoop 3.0.0-cdh6.2.0, and i'm using Spark 2.4.0

我的代码如下:

val df = spark
 .readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", bootstrapServers)
 .option("subscribe", topic)
 .option("failOnDataLoss", false)
 .load()
 .selectExpr("CAST(key AS STRING)" , "CAST(value AS STRING)")
 .as(Encoders.STRING)

df.writeStream
  .foreachBatch { (batchDF: Dataset[Row], batchId: Long) =>
     batchDF.write
           .options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseTableCatalog.newTable -> "6"))
          .format("org.apache.spark.sql.execution.datasources.hbase").save()
     }
     .option("checkpointLocation", checkpointDirectory)
     .start()
     .awaitTermination()

HBaseTableCatalog 使用 json4s-jackson_2.11 库.这个库包含在 Spark Core 中,但版本错误,会产生冲突......

The HBaseTableCatalog use json4s-jackson_2.11 library. This library is included in Spark Core, but with a bad version, which creates conflicts...

为了解决这个问题,我在spark核心中排除了json4s-jackson_2.11库,并在pom中添加了一个降级版本:

To remedy to this problem, I do an exclusion of the json4s-jackson_2.11 library in the spark core, and I add a downgraded version in the pom :

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.4.0-cdh6.2.0</version>
  <exclusions>
    <exclusion>
      <groupId>org.json4s</groupId>
      <artifactId>json4s-jackson_2.11</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>org.json4s</groupId>
  <artifactId>json4s-jackson_2.11</artifactId>
  <version>3.2.11</version>
</dependency>

当我在我的语言环境机器中执行代码时,它运行得很好,但问题是,当我在cloudera集群中提交它时,出现了库冲突的第一个错误:

When I execute the code in my locale machine, it works perfectly, but the problem, is when I submit it in the cloudera cluster, I have the first error of the conflict of librairies :

Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
        at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
        at com.App$$anonfun$main$1.apply(App.scala:129)
        at com.App$$anonfun$main$1.apply(App.scala:126)

我知道集群有自己的 hadoop 和 spark 库并且它使用它们,因此,在 spark 提交中,我将 confs spark.driver.userClassPathFirst 和 spark.executor.userClassPathFirst 设为 true,但我有另一个错误,我不明白:

I know that the cluster have its own libraries of hadoop and spark and that it use them, so, in the spark submit, i make the confs spark.driver.userClassPathFirst and spark.executor.userClassPathFirst at true, but I have another error and I don't understand it :

Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<init>(YarnSparkHadoopUtil.scala:48)
        at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<clinit>(YarnSparkHadoopUtil.scala)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply$mcJ$sp(Client.scala:83)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.deploy.yarn.Client.<init>(Client.scala:82)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassCastException: org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl cannot be cast to org.apache.hadoop.yarn.api.records.Priority
        at org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39)
        at org.apache.hadoop.yarn.api.records.Priority.<clinit>(Priority.java:34)
        ... 15 more

最后,我想要的是在我的 pom 中使用 json4s-jackson_2.11 而不是 Spark 核心中的那个来制作 Spark

Finally, what I want, is to make Spark using the json4s-jackson_2.11 in my pom and not the one in the Spark core

推荐答案

要解决这个问题,不要使用 spark.driver.userClassPathFirstspark.executor.userClassPathFirst 但相反,使用 spark.driver.extraClassPathspark.executor.extraClassPath.

To solve this, do not use spark.driver.userClassPathFirst and spark.executor.userClassPathFirst but intstead, use spark.driver.extraClassPath and spark.executor.extraClassPath.

来自官方文档的定义:额外的类路径条目前置到驱动程序的类路径."

Definition from the official documentation : "Extra classpath entries to prepend to the classpath of the driver."

  • prepend",例如,放在 Spark 的核心类路径前面.

示例:

--conf spark.driver.extraClassPath=C:\Users\Khalid\Documents\Projects\libs\jackson-annotations-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-core-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-databind-2.6.0.jar

--conf spark.driver.extraClassPath=C:\Users\Khalid\Documents\Projects\libs\jackson-annotations-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-core-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-databind-2.6.0.jar

这解决了我的问题(我想使用的 Jackson 版本与正在使用的一个 spark 版本之间存在冲突).

This solved my problem (conflict between the version of Jackson i want to use, and the one spark is using).

希望有帮助.

这篇关于排除CDH中spark-core的依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆