Spark on YARN + Secured hbase [英] Spark on YARN + Secured hbase

查看:314
本文介绍了Spark on YARN + Secured hbase的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我向YARN提交了一份工作(火花2.1.1 + kafka 0.10.2.1),它连接到一个安全的hbase集群。当我以本地模式(spark.master = local [*])运行时,这项工作表现得很好。然而,只要我提交作业与主人作为YARN(和部署模式作为客户端),我看到以下错误消息 - $ / b
$ b

 导致:javax.security。 auth.login.LoginException:无法从用户获取密码

我正在关注hortonworks的建议,关于hbase和keytab的纱线群集等。请看这篇kb文章 - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas。 html



任何指针可能会发生什么?



登录到hbase = >

  UserGroupInformation.setConfiguration(hbaseConf)
val keyTa b =keytab-location)
val principal =kerberos-principal
val ugi =用户组信息。 .doAs(new PrivilegedExceptionAction [Void](){

override def run:Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}



$ b $ p另外,我尝试了登录的替代机制,如 - >

  UserGroupInformation.loginUserFromKeytab(principal,keyTab)
connection = ConnectionFactory.createConnection(hbaseConf)
code>

请提出建议。

>你并不孤单,希望从Spark获得Kerberos身份验证给HBase,参见参考资料。 SPARK-12279



鲜为人知的事实是,Spark现在在启动时为Yarn,HDFS,Hive和HBase 生成 Hadoopauth令牌 。然后将这些令牌广播给执行者,以便他们不必再次使用Kerberos身份验证,密钥表等等。

第一个问题是它没有明确记录在案,如果失败,这些错误在默认情况下是隐藏的(例如,大多数人不会用Kerberos连接到HBase,所以通常没有任何说明HBase JAR不在CLASSPATH中,也没有创建HBase标记...通常。)

要记录有关这些标记的所有详细信息,必须为 org.apache.spark.deploy.yarn.Client设置日志级别第二个问题是,除了这些属性,Spark支持许多env变量,一些文档记录,一些没有记录,一些实际上已弃用。

例如, SPARK_CLASSPATH 现已弃用,其内容实际上已注入Spark属性 spark.driver / spark.executor.extraClassPath 。但是 SPARK_DIST_CLASSPATH 仍在使用中,例如在Cloudera发行版中,它用于注入核心Hadoop库&配置到Spark启动程序中,以便它可以在驱动程序启动之前(即在评估 spark.driver.extraClassPath 之前)引导YARN集群的执行。 >
其他感兴趣的变量是

$ ul
$ li $ H $ O $ C $ H $ O $ C $ H $

  • SPARK_CONF_DIR

  • SPARK_EXTRA_LIB_PATH

  • SPARK_SUBMIT_OPTS

  • SPARK_PRINT_LAUNCH_COMMAND



  • 第三个问题是,在某些特定情况下(例如Cloudera发行版中的YARN群集模式),Spark属性 spark.yarn.tokens.hbase.enabled 默默设置为 false - 这是绝对没有道理的,默认情况下在Spark源代码中硬编码为 true ...!

    因此建议您明确强制它到 true 在你的工作配置中。



    第四个问题是,即使HBase令牌已在启动时创建,那么执行者必须明确地使用它来进行验证。幸运的是,Cloudera为HBase贡献了一个Spark连接器,可以自动处理这种讨厌的东西。它现在是HBase客户端的一部分,默认情况下(参考 hbase-spark * .jar )。



    第五个问题是,AFAIK,如果您在CLASSPATH中没有 metrics-core * .jar ,那么HBase连接将会因为令人费解(和无关的)ZooKepper错误。


    $ b
    ¤¤¤¤¤ 如何使用调试痕迹

     #我们假设spark-env.sh和spark-default.conf已经支持Hadoop,
    #还*几乎* HBase就绪(如在CDH发行版中);
    #尤其是HADOOP_CONF_DIR和SPARK_DIST_CLASSPATH预计将被设置为
    #,但spark。*。extraClassPath / .extraJavaOptions预计未设置

    KRB_DEBUG_OPTS = - Dlog4j.logger.org .apache.spark.deploy.yarn.Client = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager $ HConnectionImplementation = DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext = DEBUG -Dsun.security.krb5.debug = true -Djava.security.debug = gssloginconfig,configfile,configparser,logincontext
    EXTRA_HBASE_CP = /等/ HBase的/ conf目录/中:/ opt / Cloudera公司/包裹/ CDH / lib中/ HBase的/ HBase的-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar

    export SPARK_SUBMIT_OPTS =$ KRB_DEBUG_OPTS
    export HADOOP_JAAS_DEBUG = true
    export SPARK_PRINT_LAUNCH_COMMAND = True

    spark-submit --master yarn-client \
    --files/etc/spark/conf/log4j.properties#yarn-log4j.properties\\
    --principal XX@Z.NET --keytab /a/b/XX.keytab \
    --conf spark.yarn.tokens.hbase.enabled = true \
    - -conf spark.driver.extraClassPath = $ EXTRA_HBASE_CP \
    --conf spark.executor.extraClassPath = $ EXTRA_HBASE_CP \
    --confspark.executor.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
    --conf spark.executorEnv.HADOOP_JAAS_DEBUG = true \
    --class TestSparkHBase TestSparkHBase.jar

    spark-submit --master yarn -cluster --conf spark.yarn.report.interval = 4000 \
    --files/etc/spark/conf/log4j.properties#yarn-log4j.properties\
    --principal XX@Z.NET --keytab /a/b/XX.keytab \
    --conf spark.yarn.tokens.hbase.enabled = true \
    --conf spark.driver.extraClassPath = $ EXTRA_HBASE_CP \
    --confspark.driver.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
    --conf spark.driverEnv.HADOOP_JAAS_DEBUG = true \\
    --conf spark.executor.extraClassPath = $ EXTRA_HBASE_CP \
    --confspark.executor.extraJavaOptions = $ KRB_DEBUG_OPTS -Dlog4j.configuration = yarn-log4j.properties\
    --conf spark.executorEnv.HADOOP_JAAS_DEBUG = true \
    --class TestSparkHBase TestSparkHBase.jar



    PS:当使用 HBaseContext 时,执行程序中不需要 / etc / hbase / conf / CLASSSPATH,conf自动传播。



    PPS:我建议你设置 log4j.logger.org.apache.zookeeper.ZooKeeper = WARN log4j.properties 中,因为它冗长,无用,甚至令人困惑(所有有趣的东西都记录在HBase级别)

    PPS:您可以静态列出 $ SPARK_CONF_DIR / log4j中的Log4J选项,而不是冗余 SPARK_SUBMIT_OPTS var。属性,其余部分位于 $ SPARK_CONF_DIR / java-opts ; $ SPARK_CONF_DIR / spark-defaults.conf 中的Spark属性以及 $ SPARK_CONF_DIR / spark-env.sh


    ¤¤¤¤¤ 关于HBase的Spark连接器



    摘自官方HBase文档 ,第83章 Basic Spark


    所有Spark和HBase集成的根源是 HBaseContext
    HBaseContext 采用HBase配置,并将它们推送到
    Spark执行程序。这使得我们可以在静态位置为每个
    Spark Executor创建一个HBase连接。


    文档中没有提及什么是 HBaseContext 自动使用HBase授权令牌(如果有的话)来验证执行者。



    注还有一个例子(在Scala中,然后在Java中)使用 BufferedMutator 对RDD执行Spark foreachPartition 操作c>用于将异步批量加载到HBase中。


    I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).

    However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -

    Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
    

    I am following hortonworks recommendations for providing information to yarn cluster regarding the hbase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html

    Any pointers what could be going on ?

    the mechanism for logging into hbase =>

    UserGroupInformation.setConfiguration(hbaseConf)
    val keyTab = "keytab-location") 
    val principal = "kerberos-principal"
    val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
    UserGroupInformation.setLoginUser(ugi)
    ugi.doAs(new PrivilegedExceptionAction[Void]() {
    
    override def run: Void = {
      hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
      null
    }
    })
    

    Also, I tried the alternative mechanism for logging in, as ->

    UserGroupInformation.loginUserFromKeytab(principal, keyTab)
    connection=ConnectionFactory.createConnection(hbaseConf)
    

    please suggest.

    解决方案

    You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279

    A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.

    The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
    To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client to DEBUG.

    The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
    For instance, SPARK_CLASSPATH is now deprecated, and its content actually injected in Spark properties spark.driver / spark.executor.extraClassPath.
    But SPARK_DIST_CLASSPATH is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath is evaluated).
    Other variables of interest are

    • HADOOP_CONF_DIR
    • SPARK_CONF_DIR
    • SPARK_EXTRA_LIB_PATH
    • SPARK_SUBMIT_OPTS
    • SPARK_PRINT_LAUNCH_COMMAND

    The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled is set silently to false -- which makes absolutely no sense, that default is hard-coded to true in Spark source code...!
    So you are advised to force it explicitly to true in your job config.

    The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar).

    The fifth problem is that, AFAIK, if you don't have metrics-core*.jar in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.


    ¤¤¤¤¤ How to make that stuff work, with debug traces

    # we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
    # and also *almost* HBase-ready (as in a CDH distro);
    # especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
    # but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
    
    KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
    EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
    
    export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
    export HADOOP_JAAS_DEBUG=true
    export SPARK_PRINT_LAUNCH_COMMAND=True
    
    spark-submit --master yarn-client \
      --files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
      --principal XX@Z.NET --keytab /a/b/XX.keytab \
      --conf spark.yarn.tokens.hbase.enabled=true \
      --conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
      --conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
      --conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
      --conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
      --class TestSparkHBase  TestSparkHBase.jar
    
    spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
      --files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
      --principal XX@Z.NET --keytab /a/b/XX.keytab \
      --conf spark.yarn.tokens.hbase.enabled=true \
      --conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
      --conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
      --conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
      --conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
      --conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
      --conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
      --class TestSparkHBase  TestSparkHBase.jar
    

    PS: when using a HBaseContext you don't need /etc/hbase/conf/ in the executor's CLASSPATH, the conf is propagated automatically.

    PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN in log4j.properties because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)

    PPS: instead of that verbose SPARK_SUBMIT_OPTS var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties and the rest in $SPARK_CONF_DIR/java-opts; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf and env variables in $SPARK_CONF_DIR/spark-env.sh


    ¤¤¤¤¤ About the "Spark connector" to HBase

    Excerpt from the official HBase documentation, chapter 83 Basic Spark

    At the root of all Spark and HBase integration is the HBaseContext. The HBaseContext takes in HBase configurations and pushes them to the Spark executors. This allows us to have an HBase Connection per Spark Executor in a static location.

    What is not mentioned in the doc is that the HBaseContext uses automatically the HBase "auth token" (when present) to authenticate the executors.

    Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition operation on a RDD, using a BufferedMutator for async bulk load into HBase.

    这篇关于Spark on YARN + Secured hbase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆