如何使用用户提供的Hadoop正确配置Spark 2.4 [英] How to configure Spark 2.4 correctly with user-provided Hadoop

查看：121 发布时间：2021/4/8 19:37:24 apache-spark hadoop hive hadoop2

本文介绍了如何使用用户提供的Hadoop正确配置Spark 2.4的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用Spark 2.4.5(当前稳定的Spark版本)和Hadoop 2.10(当前2.x系列的稳定Hadoop版本).此外，我需要访问HDFS，Hive，S3和Kafka.

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.

http://spark.apache.org 提供了预先构建并捆绑在一起的Spark 2.4.5Hadoop 2.6或Hadoop 2.7.另一种选择是将Spark 与用户提供的Hadoop 一起使用，因此我尝试了这一点.

http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7. Another option is to use the Spark with user-provided Hadoop, so I tried that one.

由于与用户提供的Hadoop 一起使用，Spark也不包含Hive库.会出现错误，例如:如何在Hive支持下创建SparkSession(失败找不到Hive类")?

As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either. There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?

当我将 spark-hive 依赖项添加到 spark-shell ( spark-submit 也受到影响)时，通过

When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using

spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5

在 spark-defaults.conf 中的

，出现此错误:

in spark-defaults.conf, I get this error:

20/02/26 11:20:45 ERROR spark.SparkContext: 
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)

因为 spark-shell 不能与分类依赖项一起处理分类器，请参见 https://github.com/apache/spark/pull/21339 和 https://github.com/apache/spark/pull/17416

because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416

分类器问题的解决方法如下:

A workaround for the classifier probleme looks like this:

$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar

但是DevOps不会接受.

but DevOps won't accept this.

完整的依赖项列表如下所示(我添加了换行符以提高可读性)

The complete list of dependencies looks like this (I have added line breaks for better readability)

root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307

(一切正常-蜂巢除外)

(everything works - except for Hive)

Spark 2.4.5和Hadoop 2.10的组合是否在任何地方使用?怎么样?
如何将 Spark 2.4.5与用户提供的Hadoop 和Hadoop 2.9或2.10结合在一起?
是否有必要构建Spark来解决Hive依赖问题?

Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?

推荐答案

用用户提供的Hadoop将 Spark 2.4.5配置为使用Hadoop 2.10.0似乎不是一个简单的方法.

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0

由于我的任务实际上是最大程度地减少依赖关系问题，因此我选择编译针对Hadoop 2.10.0的Spark 2.4.5.

As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.

./dev/make-distribution.sh \ --name hadoop-2.10.0 \ --tgz \ -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \ -Phive -Phive-thriftserver \ -Pyarn

现在，Maven处理Hive依赖项/分类器，并且可以使用生成的包.

Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.

我个人认为，与使用用户提供的Hadoop配置 Spark 相比，编译Spark实际上要容易得多.

In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.

到目前为止，集成测试还没有发现任何问题，Spark可以访问HDFS和S3(MinIO).

Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).

更新2021-04-08

如果要添加对Kubernetes的支持，只需将 -Pkubernetes 添加到参数列表中

If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

这篇关于如何使用用户提供的Hadoop正确配置Spark 2.4的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用用户提供的Hadoop正确配置Spark 2.4 [英] How to configure Spark 2.4 correctly with user-provided Hadoop

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用用户提供的Hadoop正确配置Spark 2.4 [英] How to configure Spark 2.4 correctly with user-provided Hadoop

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭