排除CDH中spark-core的依赖 [英] Exclusion of dependency of spark-core in CDH

查看：54 发布时间：2021/11/12 2:06:36 apache-spark hadoop apache-kafka hbase cloudera-cdh

本文介绍了排除CDH中spark-core的依赖的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Structured Spark Streaming 写入来自 Kafka 的 HBase 数据.

I'm using Structured Spark Streaming to write to HBase data coming from Kafka.

我的集群分布是:Hadoop 3.0.0-cdh6.2.0，我使用的是 Spark 2.4.0

My cluster distribution is : Hadoop 3.0.0-cdh6.2.0, and i'm using Spark 2.4.0

我的代码如下:

val df = spark
 .readStream
 .format("kafka")
 .option("kafka.bootstrap.servers", bootstrapServers)
 .option("subscribe", topic)
 .option("failOnDataLoss", false)
 .load()
 .selectExpr("CAST(key AS STRING)" , "CAST(value AS STRING)")
 .as(Encoders.STRING)

df.writeStream
  .foreachBatch { (batchDF: Dataset[Row], batchId: Long) =>
     batchDF.write
           .options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseTableCatalog.newTable -> "6"))
          .format("org.apache.spark.sql.execution.datasources.hbase").save()
     }
     .option("checkpointLocation", checkpointDirectory)
     .start()
     .awaitTermination()

HBaseTableCatalog 使用 json4s-jackson_2.11 库.这个库包含在 Spark Core 中，但版本错误，会产生冲突......

The HBaseTableCatalog use json4s-jackson_2.11 library. This library is included in Spark Core, but with a bad version, which creates conflicts...

为了解决这个问题，我在spark核心中排除了json4s-jackson_2.11库，并在pom中添加了一个降级版本:

To remedy to this problem, I do an exclusion of the json4s-jackson_2.11 library in the spark core, and I add a downgraded version in the pom :

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.4.0-cdh6.2.0</version>
  <exclusions>
    <exclusion>
      <groupId>org.json4s</groupId>
      <artifactId>json4s-jackson_2.11</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>org.json4s</groupId>
  <artifactId>json4s-jackson_2.11</artifactId>
  <version>3.2.11</version>
</dependency>

当我在我的语言环境机器中执行代码时，它运行得很好，但问题是，当我在cloudera集群中提交它时，出现了库冲突的第一个错误:

When I execute the code in my locale machine, it works perfectly, but the problem, is when I submit it in the cloudera cluster, I have the first error of the conflict of librairies :

Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
        at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
        at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
        at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
        at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
        at com.App$$anonfun$main$1.apply(App.scala:129)
        at com.App$$anonfun$main$1.apply(App.scala:126)

我知道集群有自己的 hadoop 和 spark 库并且它使用它们，因此，在 spark 提交中，我将 confs spark.driver.userClassPathFirst 和 spark.executor.userClassPathFirst 设为 true，但我有另一个错误，我不明白:

I know that the cluster have its own libraries of hadoop and spark and that it use them, so, in the spark submit, i make the confs spark.driver.userClassPathFirst and spark.executor.userClassPathFirst at true, but I have another error and I don't understand it :

Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<init>(YarnSparkHadoopUtil.scala:48)
        at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<clinit>(YarnSparkHadoopUtil.scala)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply$mcJ$sp(Client.scala:83)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
        at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.deploy.yarn.Client.<init>(Client.scala:82)
        at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassCastException: org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl cannot be cast to org.apache.hadoop.yarn.api.records.Priority
        at org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39)
        at org.apache.hadoop.yarn.api.records.Priority.<clinit>(Priority.java:34)
        ... 15 more

最后，我想要的是在我的 pom 中使用 json4s-jackson_2.11 而不是 Spark 核心中的那个来制作 Spark

Finally, what I want, is to make Spark using the json4s-jackson_2.11 in my pom and not the one in the Spark core

排除CDH中spark-core的依赖 [英] Exclusion of dependency of spark-core in CDH

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

排除CDH中spark-core的依赖 [英] Exclusion of dependency of spark-core in CDH

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭