如何使用用户提供的Hadoop正确配置Spark 2.4 [英] How to configure Spark 2.4 correctly with user-provided Hadoop

查看:121
本文介绍了如何使用用户提供的Hadoop正确配置Spark 2.4的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Spark 2.4.5(当前稳定的Spark版本)和Hadoop 2.10(当前2.x系列的稳定Hadoop版本).此外,我需要访问HDFS,Hive,S3和Kafka.

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.

http://spark.apache.org 提供了预先构建并捆绑在一起的Spark 2.4.5Hadoop 2.6或Hadoop 2.7.另一种选择是将Spark 与用户提供的Hadoop 一起使用,因此我尝试了这一点.

http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7. Another option is to use the Spark with user-provided Hadoop, so I tried that one.

由于与用户提供的Hadoop 一起使用,Spark也不包含Hive库.会出现错误,例如:如何在Hive支持下创建SparkSession(失败找不到Hive类")?

As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either. There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?

当我将 spark-hive 依赖项添加到 spark-shell ( spark-submit 也受到影响)时,通过

When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using

spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5

spark-defaults.conf 中的

,出现此错误:

in spark-defaults.conf, I get this error:

20/02/26 11:20:45 ERROR spark.SparkContext: 
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)

因为 spark-shell 不能与分类依赖项一起处理分类器,请参见 https://github.com/apache/spark/pull/21339 https://github.com/apache/spark/pull/17416

because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416

分类器问题的解决方法如下:

A workaround for the classifier probleme looks like this:

$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar

但是DevOps不会接受.

but DevOps won't accept this.

完整的依赖项列表如下所示(我添加了换行符以提高可读性)

The complete list of dependencies looks like this (I have added line breaks for better readability)

root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307

(一切正常-蜂巢除外)

(everything works - except for Hive)

  • Spark 2.4.5和Hadoop 2.10的组合是否在任何地方使用?怎么样?
  • 如何将 Spark 2.4.5与用户提供的Hadoop 和Hadoop 2.9或2.10结合在一起?
  • 是否有必要构建Spark来解决Hive依赖问题?
  • Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
  • How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
  • Is it necessary to build Spark to get around the Hive dependency problem ?

推荐答案

用用户提供的Hadoop将 Spark 2.4.5配置为使用Hadoop 2.10.0似乎不是一个简单的方法.

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0

由于我的任务实际上是最大程度地减少依赖关系问题,因此我选择编译针对Hadoop 2.10.0的Spark 2.4.5.

As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.

./dev/make-distribution.sh \
  --name hadoop-2.10.0 \
  --tgz \
  -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
  -Phive -Phive-thriftserver \
  -Pyarn

现在,Maven处理Hive依赖项/分类器,并且可以使用生成的包.

Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.

我个人认为,与使用用户提供的Hadoop配置 Spark 相比,编译Spark实际上要容易得多.

In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.

到目前为止,集成测试还没有发现任何问题,Spark可以访问HDFS和S3(MinIO).

Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).

更新2021-04-08

如果要添加对Kubernetes的支持,只需将 -Pkubernetes 添加到参数列表中

If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

这篇关于如何使用用户提供的Hadoop正确配置Spark 2.4的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆