Spark on YARN + Secured hbase [英] Spark on YARN + Secured hbase

查看:31
本文介绍了Spark on YARN + Secured hbase的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在向连接到安全 hbase 集群的 YARN(在 spark 2.1.1 + kafka 0.10.2.1 上)提交作业.当我在本地"模式(spark.master=local[*])下运行时,这项工作执行得很好.

I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).

但是,一旦我以 YARN(并将部署模式作为客户端)提交作业,我就会看到以下错误消息 -

However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -

Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user

我正在遵循 hortonworks 的建议,以向纱线集群提供有关 HBase 和 keytab 等的信息.遵循这篇知识库文章 - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html

I am following hortonworks recommendations for providing information to yarn cluster regarding the HBase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html

任何指示可能会发生什么?

Any pointers what could be going on ?

登录HBase的机制:

the mechanism for logging into HBase:

UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location") 
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {

override def run: Void = {
  hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
  null
}
})

另外,我尝试了另一种登录机制,如:

Also, I tried the alternative mechanism for logging in, as:

UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)

请提出建议.

推荐答案

在从 Spark 寻求 Kerberos 身份验证到 HBase 的过程中,您并不孤单,参见.SPARK-12279

You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279

一个鲜为人知的事实是,Spark 现在启动时为 Yarn、HDFS、Hive、HBase 生成 Hadoop身份验证令牌".然后将这些令牌广播给执行程序,这样他们就不必再与 Kerberos 身份验证、密钥表等混淆了.

A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.

第一个问题是它没有明确记录,并且在失败的情况下默认隐藏错误(即大多数人不使用 Kerberos 连接到 HBase,因此声明 HBase JAR 通常是没有意义的不在 CLASSPATH 中,也没有创建 HBase 令牌......通常.)
要记录有关这些令牌的所有详细信息,您必须将 org.apache.spark.deploy.yarn.Client 的日志级别设置为 DEBUG.

The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client to DEBUG.

第二个问题是,除了属性之外,Spark 还支持许多 env 变量,有些已记录,有些未记录,有些实际上已弃用.
例如,SPARK_CLASSPATH 现在已被弃用,其内容实际上注入了 Spark 属性 spark.driver/spark.executor.extraClassPath.
但是 SPARK_DIST_CLASSPATH 仍在使用,例如在 Cloudera 发行版中,它用于注入核心 Hadoop libs &配置到 Spark启动器"中,以便它可以在驱动程序启动之前(即在评估 spark.driver.extraClassPath 之前)引导 YARN 集群执行.
其他感兴趣的变量是

The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH is now deprecated, and its content actually injected in Spark properties spark.driver / spark.executor.extraClassPath.
But SPARK_DIST_CLASSPATH is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath is evaluated).
Other variables of interest are

  • HADOOP_CONF_DIR
  • SPARK_CONF_DIR
  • SPARK_EXTRA_LIB_PATH
  • SPARK_SUBMIT_OPTS
  • SPARK_PRINT_LAUNCH_COMMAND

第三个问题是,在某些特定情况下(例如 Cloudera 发行版中的 YARN-cluster 模式),Spark 属性 spark.yarn.tokens.hbase.enabled 静默设置为 false -- 这绝对没有意义,默认值在 Spark 源代码中硬编码为 true ......!
因此,建议您在作业配置中明确将其强制为 true.

The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled is set silently to false -- which makes absolutely no sense, that default is hard-coded to true in Spark source code...!
So you are advised to force it explicitly to true in your job config.

第四个问题是,即使 HBase 令牌在启动时已经创建,那么执行者也必须明确使用它来进行身份验证.幸运的是,Cloudera 为 HBase 贡献了一个Spark 连接器",可以自动处理这种讨厌的东西.默认情况下,它现在是 HBase 客户端的一部分(参见 hbase-spark*.jar).

The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar).

第五个问题是,AFAIK,如果您在 CLASSPATH 中没有 metrics-core*.jar 那么 HBase 连接将失败并令人费解(并且无关)ZooKepper 错误.

The fifth problem is that, AFAIK, if you don't have metrics-core*.jar in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.


¤¤¤¤¤ 如何使用调试跟踪使这些东西工作

# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset

KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar

export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True

spark-submit --master yarn-client \
  --files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
  --principal XX@Z.NET --keytab /a/b/XX.keytab \
  --conf spark.yarn.tokens.hbase.enabled=true \
  --conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
  --conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
  --conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
  --conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
  --class TestSparkHBase  TestSparkHBase.jar

spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
  --files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
  --principal XX@Z.NET --keytab /a/b/XX.keytab \
  --conf spark.yarn.tokens.hbase.enabled=true \
  --conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
  --conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
  --conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
  --conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
  --conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
  --conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
  --class TestSparkHBase  TestSparkHBase.jar

PS:当使用 HBaseContext 时,执行器的 CLASSPATH 中不需要 /etc/hbase/conf/,conf 会自动传播.

PS: when using a HBaseContext you don't need /etc/hbase/conf/ in the executor's CLASSPATH, the conf is propagated automatically.

PPS:我建议你在 log4j.properties 中设置 log4j.logger.org.apache.zookeeper.ZooKeeper=WARN 因为它冗长、无用,甚至令人困惑(所有有趣的东西都记录在 HBase 级别)

PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN in log4j.properties because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)

PPS:代替冗长的 SPARK_SUBMIT_OPTS var,您还可以在 $SPARK_CONF_DIR/log4j.properties 中静态列出 Log4J 选项,在 $SPARK_CONF_DIR 中列出其余选项/java-opts;$SPARK_CONF_DIR/spark-defaults.conf 中的 Spark 属性和 $SPARK_CONF_DIR/spark-env.sh

PPS: instead of that verbose SPARK_SUBMIT_OPTS var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties and the rest in $SPARK_CONF_DIR/java-opts; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf and env variables in $SPARK_CONF_DIR/spark-env.sh


¤¤¤¤¤ 关于 HBase 的Spark 连接器"

摘自 官方 HBase 文档,第 83 章 Basic Spark

Excerpt from the official HBase documentation, chapter 83 Basic Spark

所有 Spark 和 HBase 集成的根源在于 HBaseContext.HBaseContext 接收 HBase 配置并将它们推送到Spark 执行器.这允许我们每个人都有一个 HBase 连接Spark Executor 位于静态位置.

At the root of all Spark and HBase integration is the HBaseContext. The HBaseContext takes in HBase configurations and pushes them to the Spark executors. This allows us to have an HBase Connection per Spark Executor in a static location.

文档中没有提到的是 HBaseContext 自动使用 HBase身份验证令牌"(如果存在)来对执行程序进行身份验证.

What is not mentioned in the doc is that the HBaseContext uses automatically the HBase "auth token" (when present) to authenticate the executors.

另请注意,该文档有一个在 RDD 上执行 Spark foreachPartition 操作的示例(在 Scala 中,然后在 Java 中),使用 BufferedMutator 异步批量加载到 HBase.

Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition operation on a RDD, using a BufferedMutator for async bulk load into HBase.

这篇关于Spark on YARN + Secured hbase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆