Spark on YARN + Secured hbase [英] Spark on YARN + Secured hbase
问题描述
我正在向连接到安全 hbase 集群的 YARN(在 spark 2.1.1 + kafka 0.10.2.1 上)提交作业.当我在本地"模式(spark.master=local[*])下运行时,这项工作执行得很好.
I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).
但是,一旦我以 YARN(并将部署模式作为客户端)提交作业,我就会看到以下错误消息 -
However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
我正在遵循 hortonworks 的建议,以向纱线集群提供有关 HBase 和 keytab 等的信息.遵循这篇知识库文章 - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
I am following hortonworks recommendations for providing information to yarn cluster regarding the HBase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
任何指示可能会发生什么?
Any pointers what could be going on ?
登录HBase的机制:
the mechanism for logging into HBase:
UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location")
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run: Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
})
另外,我尝试了另一种登录机制,如:
Also, I tried the alternative mechanism for logging in, as:
UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)
请提出建议.
推荐答案
在从 Spark 寻求 Kerberos 身份验证到 HBase 的过程中,您并不孤单,参见.SPARK-12279
You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279
一个鲜为人知的事实是,Spark 现在启动时为 Yarn、HDFS、Hive、HBase 生成 Hadoop身份验证令牌".然后将这些令牌广播给执行程序,这样他们就不必再与 Kerberos 身份验证、密钥表等混淆了.
A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.
第一个问题是它没有明确记录,并且在失败的情况下默认隐藏错误(即大多数人不使用 Kerberos 连接到 HBase,因此声明 HBase JAR 通常是没有意义的不在 CLASSPATH 中,也没有创建 HBase 令牌......通常.)
要记录有关这些令牌的所有详细信息,您必须将 org.apache.spark.deploy.yarn.Client
的日志级别设置为 DEBUG.
The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client
to DEBUG.
第二个问题是,除了属性之外,Spark 还支持许多 env 变量,有些已记录,有些未记录,有些实际上已弃用.
例如,SPARK_CLASSPATH
现在已被弃用,其内容实际上注入了 Spark 属性 spark.driver
/spark.executor.extraClassPath
.
但是 SPARK_DIST_CLASSPATH
仍在使用,例如在 Cloudera 发行版中,它用于注入核心 Hadoop libs &配置到 Spark启动器"中,以便它可以在驱动程序启动之前(即在评估 spark.driver.extraClassPath
之前)引导 YARN 集群执行.
其他感兴趣的变量是
The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH
is now deprecated, and its content actually injected in Spark properties spark.driver
/ spark.executor.extraClassPath
.
But SPARK_DIST_CLASSPATH
is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath
is evaluated).
Other variables of interest are
HADOOP_CONF_DIR
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
第三个问题是,在某些特定情况下(例如 Cloudera 发行版中的 YARN-cluster 模式),Spark 属性 spark.yarn.tokens.hbase.enabled
静默设置为 false
-- 这绝对没有意义,默认值在 Spark 源代码中硬编码为 true
......!
因此,建议您在作业配置中明确将其强制为 true
.
The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled
is set silently to false
-- which makes absolutely no sense, that default is hard-coded to true
in Spark source code...!
So you are advised to force it explicitly to true
in your job config.
第四个问题是,即使 HBase 令牌在启动时已经创建,那么执行者也必须明确使用它来进行身份验证.幸运的是,Cloudera 为 HBase 贡献了一个Spark 连接器",可以自动处理这种讨厌的东西.默认情况下,它现在是 HBase 客户端的一部分(参见 hbase-spark*.jar
).
The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar
).
第五个问题是,AFAIK,如果您在 CLASSPATH 中没有 metrics-core*.jar
那么 HBase 连接将失败并令人费解(并且无关)ZooKepper 错误.
The fifth problem is that, AFAIK, if you don't have metrics-core*.jar
in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.
¤¤¤¤¤ 如何使用调试跟踪使这些东西工作
# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True
spark-submit --master yarn-client \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX@Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX@Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
PS:当使用 HBaseContext
时,执行器的 CLASSPATH 中不需要 /etc/hbase/conf/
,conf 会自动传播.
PS: when using a HBaseContext
you don't need /etc/hbase/conf/
in the executor's CLASSPATH, the conf is propagated automatically.
PPS:我建议你在 log4j.properties
中设置 log4j.logger.org.apache.zookeeper.ZooKeeper=WARN
因为它冗长、无用,甚至令人困惑(所有有趣的东西都记录在 HBase 级别)
PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN
in log4j.properties
because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)
PPS:代替冗长的 SPARK_SUBMIT_OPTS
var,您还可以在 $SPARK_CONF_DIR/log4j.properties
中静态列出 Log4J 选项,在 $SPARK_CONF_DIR 中列出其余选项/java-opts
;$SPARK_CONF_DIR/spark-defaults.conf
中的 Spark 属性和 $SPARK_CONF_DIR/spark-env.sh
PPS: instead of that verbose SPARK_SUBMIT_OPTS
var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties
and the rest in $SPARK_CONF_DIR/java-opts
; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf
and env variables in $SPARK_CONF_DIR/spark-env.sh
¤¤¤¤¤ 关于 HBase 的Spark 连接器"
摘自 官方 HBase 文档,第 83 章 Basic Spark
Excerpt from the official HBase documentation, chapter 83 Basic Spark
所有 Spark 和 HBase 集成的根源在于 HBaseContext
.HBaseContext
接收 HBase 配置并将它们推送到Spark 执行器.这允许我们每个人都有一个 HBase 连接Spark Executor 位于静态位置.
At the root of all Spark and HBase integration is the
HBaseContext
. TheHBaseContext
takes in HBase configurations and pushes them to the Spark executors. This allows us to have an HBase Connection per Spark Executor in a static location.
文档中没有提到的是 HBaseContext
自动使用 HBase身份验证令牌"(如果存在)来对执行程序进行身份验证.
What is not mentioned in the doc is that the HBaseContext
uses automatically the HBase "auth token" (when present) to authenticate the executors.
另请注意,该文档有一个在 RDD 上执行 Spark foreachPartition
操作的示例(在 Scala 中,然后在 Java 中),使用 BufferedMutator
异步批量加载到 HBase.
Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition
operation on a RDD, using a BufferedMutator
for async bulk load into HBase.
这篇关于Spark on YARN + Secured hbase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!