排除CDH中spark-core的依赖 [英] Exclusion of dependency of spark-core in CDH
问题描述
我正在使用 Structured Spark Streaming 写入来自 Kafka 的 HBase 数据.
I'm using Structured Spark Streaming to write to HBase data coming from Kafka.
我的集群分布是:Hadoop 3.0.0-cdh6.2.0,我使用的是 Spark 2.4.0
My cluster distribution is : Hadoop 3.0.0-cdh6.2.0, and i'm using Spark 2.4.0
我的代码如下:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("failOnDataLoss", false)
.load()
.selectExpr("CAST(key AS STRING)" , "CAST(value AS STRING)")
.as(Encoders.STRING)
df.writeStream
.foreachBatch { (batchDF: Dataset[Row], batchId: Long) =>
batchDF.write
.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseTableCatalog.newTable -> "6"))
.format("org.apache.spark.sql.execution.datasources.hbase").save()
}
.option("checkpointLocation", checkpointDirectory)
.start()
.awaitTermination()
HBaseTableCatalog 使用 json4s-jackson_2.11 库.这个库包含在 Spark Core 中,但版本错误,会产生冲突......
The HBaseTableCatalog use json4s-jackson_2.11 library. This library is included in Spark Core, but with a bad version, which creates conflicts...
为了解决这个问题,我在spark核心中排除了json4s-jackson_2.11库,并在pom中添加了一个降级版本:
To remedy to this problem, I do an exclusion of the json4s-jackson_2.11 library in the spark core, and I add a downgraded version in the pom :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0-cdh6.2.0</version>
<exclusions>
<exclusion>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
<version>3.2.11</version>
</dependency>
当我在我的语言环境机器中执行代码时,它运行得很好,但问题是,当我在cloudera集群中提交它时,出现了库冲突的第一个错误:
When I execute the code in my locale machine, it works perfectly, but the problem, is when I submit it in the cloudera cluster, I have the first error of the conflict of librairies :
Caused by: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at com.App$$anonfun$main$1.apply(App.scala:129)
at com.App$$anonfun$main$1.apply(App.scala:126)
我知道集群有自己的 hadoop 和 spark 库并且它使用它们,因此,在 spark 提交中,我将 confs spark.driver.userClassPathFirst 和 spark.executor.userClassPathFirst 设为 true,但我有另一个错误,我不明白:
I know that the cluster have its own libraries of hadoop and spark and that it use them, so, in the spark submit, i make the confs spark.driver.userClassPathFirst and spark.executor.userClassPathFirst at true, but I have another error and I don't understand it :
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<init>(YarnSparkHadoopUtil.scala:48)
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.<clinit>(YarnSparkHadoopUtil.scala)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply$mcJ$sp(Client.scala:83)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
at org.apache.spark.deploy.yarn.Client$$anonfun$1.apply(Client.scala:83)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.yarn.Client.<init>(Client.scala:82)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassCastException: org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl cannot be cast to org.apache.hadoop.yarn.api.records.Priority
at org.apache.hadoop.yarn.api.records.Priority.newInstance(Priority.java:39)
at org.apache.hadoop.yarn.api.records.Priority.<clinit>(Priority.java:34)
... 15 more
最后,我想要的是在我的 pom 中使用 json4s-jackson_2.11 而不是 Spark 核心中的那个来制作 Spark
Finally, what I want, is to make Spark using the json4s-jackson_2.11 in my pom and not the one in the Spark core
推荐答案
要解决这个问题,不要使用 spark.driver.userClassPathFirst
和 spark.executor.userClassPathFirst
但相反,使用 spark.driver.extraClassPath
和 spark.executor.extraClassPath
.
To solve this, do not use spark.driver.userClassPathFirst
and spark.executor.userClassPathFirst
but intstead, use spark.driver.extraClassPath
and spark.executor.extraClassPath
.
来自官方文档的定义:额外的类路径条目前置到驱动程序的类路径."
Definition from the official documentation : "Extra classpath entries to prepend to the classpath of the driver."
- prepend",例如,放在 Spark 的核心类路径前面.
示例:
--conf spark.driver.extraClassPath=C:\Users\Khalid\Documents\Projects\libs\jackson-annotations-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-core-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-databind-2.6.0.jar
--conf spark.driver.extraClassPath=C:\Users\Khalid\Documents\Projects\libs\jackson-annotations-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-core-2.6.0.jar;C:\Users\Khalid\Documents\Projects\libs\jackson-databind-2.6.0.jar
这解决了我的问题(我想使用的 Jackson 版本与正在使用的一个 spark 版本之间存在冲突).
This solved my problem (conflict between the version of Jackson i want to use, and the one spark is using).
希望有帮助.
这篇关于排除CDH中spark-core的依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!