从星火Accesing HDFS提供TokenCache错误无法获取主Kerberos主要用作新生 [英] Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer
问题描述
我试图以星火连接到Hadoop的运行测试的Spark脚本。
剧本是以下
I'm trying to run a test Spark script in order to connect Spark to hadoop. The script is the following
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
file = sc.textFile("hdfs://hadoop_node.place:9000/errs.txt")
errors = file.filter(lambda line: "ERROR" in line)
errors.count()
当我pyspark运行它,我得到
When I run it with pyspark I get
py4j.protocol.Py4JJavaError:在调用时发生错误
o21.collect。 :java.io.IOException异常:无法获得主Kerberos
主要用作新生
在org.apache.hadoop.ma$p$pduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
在org.apache.hadoop.ma$p$pduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
在org.apache.hadoop.ma preduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
在org.apache.hadoop.ma pred.FileInputFormat.listStatus(FileInputFormat.java:187)
在org.apache.hadoop.ma pred.FileInputFormat.getSplits(FileInputFormat.java:251)
在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:207)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:205)
在scala.Option.getOrElse(Option.scala:120)
在org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
在org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:207)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:205)
在scala.Option.getOrElse(Option.scala:120)
在org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
在org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:46)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:207)
在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:205)
在scala.Option.getOrElse(Option.scala:120)
在org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
在org.apache.spark.SparkContext.runJob(SparkContext.scala:898)
在org.apache.spark.rdd.RDD.collect(RDD.scala:608)
在org.apache.spark.api.java.JavaRDDLike $ class.collect(JavaRDDLike.scala:243)
在org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:27)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
在java.lang.reflect.Method.invoke(Method.java:606)
在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
在py4j.Gateway.invoke(Gateway.java:259)
在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run(GatewayConnection.java:207)
在java.lang.Thread.run(Thread.java:744)
py4j.protocol.Py4JJavaError: An error occurred while calling o21.collect. : java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:46) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:898) at org.apache.spark.rdd.RDD.collect(RDD.scala:608) at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:243) at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744)
这会发生,尽管事实
- 我已经做了的kinit和klist的显示我有正确的标记
- 当我发出./bin/hadoop FS -ls HDFS://hadoop_node.place:9000 / errs.txt
它显示了文件 - 无论是本地的Hadoop客户端和火花具有相同的配置文件
核心-site.xml中火花/ conf目录和Hadoop / conf目录文件夹是以下
(从Hadoop的节点之一得到了它)
The core-site.xml in the spark/conf and hadoop/conf folders is the following (got it from one of the hadoop nodes)
<configuration>
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1](.*@place)s/@place//
RULE:[2:$1/$2@$0](.*/node1.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node2.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node3.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node4.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node5.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node6.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:$1/$2@$0](.*/node7.place@place)s/^([a-zA-Z]*).*/$1/
RULE:[2:nobody]
DEFAULT
</value>
</property>
<property>
<name>net.topology.node.switch.mapping.impl</name>
<value>org.apache.hadoop.net.TableMapping</value>
</property>
<property>
<name>net.topology.table.file.name</name>
<value>/etc/hadoop/conf/topology.table.file</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server.place:9000/</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
</configuration>
有人能指出我错过了什么?
Can someone point out what am I missing?
推荐答案
为了更好地了解Hadoop的工作创造我自己的Hadoop集群后。我固定它。
After creating my own hadoop cluster in order to better understand how hadoop works. I fixed it.
您必须提供的Spark与已经为其至少具有读取访问Hadoop集群的帐户生成有效.keytab文件。
You have to provide Spark with a valid .keytab file which has been generated for an account which has at least read access to the hadoop cluster.
此外,您还必须提供与你的HDFS集群的HDFS-site.xml中的火花。
Also, you have to provide spark with the hdfs-site.xml of your hdfs cluster.
因此,对于我来说,我必须创建一个密钥表文件,当您运行
So for my case I had to create a keytab file which when you run
klist的-k -e -t
klist -k -e -t
在这你得到以下项
host/fully.qualified.domain.name@REALM.COM
host/fully.qualified.domain.name@REALM.COM
在我的情况下,主机是字面字主机,而不是一个变量。
此外,在您的HDFS-site.xml中,你必须提供密钥表文件的路径,并说,
In my case the host was the literal word host and not a variable. Also in your hdfs-site.xml you have to provide the path of the keytab file and say that
host/_HOST@REALM.COM
host/_HOST@REALM.COM
将是您的帐户。
Cloudera公司对如何做到这一点一个pretty详细的书面记录。
Cloudera has a pretty detailed writeup on how to do it.
修改
打一点点的不同配置后,我认为应该注意以下几点。
你必须提供Hadoop集群的确切HDFS-site.xml中和核心site.xml的火花。否则它不会工作。
Edit after playing a little bit with different configurations I think the following should be noted. You have to provide spark with the exact hdfs-site.xml and core-site.xml of your hadoop cluster. Otherwise it wont work
这篇关于从星火Accesing HDFS提供TokenCache错误无法获取主Kerberos主要用作新生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!