我应该使用哪种适用于Spark 2.0的HBase连接器? [英] Which HBase connector for Spark 2.0 should I use?

查看:260
本文介绍了我应该使用哪种适用于Spark 2.0的HBase连接器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的堆栈由Google Data Proc(Spark 2.0)和Google BigTable(HBase 1.2.0)组成,我正在寻找一个可以与这些版本兼容的连接器。

对于我所找到的连接器,Spark 2.0和新的DataSet API支持并不清楚:


  • spark-hbase https://github.com/apache/hbase/tree/ master / hbase-spark

  • spark-hbase-connector https://github.com/nerdammer/spark-hbase-connector

  • hortonworks-spark / shc
  • strong>: https://github.com/hortonworks-spark/shc



    该项目使用SBT编写在Scala 2.11中。



    感谢您的帮助

    解决方案

    Update :SHC现在似乎可以与Spark 2和Table API一起使用。请参阅 https://github.com/GoogleCloudPlatform/cloud -bigtable-examples / tree / master / scala / bigtable-shc



    原始答案: b

    我不相信这些(或任何其他现有的连接器)中的任何一个都可以做到您今天想要的一切。

      $ b $在发布时(HBase 1.4?),b
    • spark-hbase 可能是正确的解决方案,但目前只能在头部和仍在使用Spark 2支持

    • spark-hbase-connector 似乎只支持RDD API,但由于它们更稳定,可能会有所帮助。

    • hortonworks-spark / shc 可能不起作用,因为我相信它只支持Spark 1并使用旧版HTable API,而这些API不适用于BigTable。 / li>


    我会推荐使用HBase MapReduce API以及newAPIHadoopRDD(或者spark-hbase-connector?)等RDD方法。然后手动将RDD转换为DataSet。这种方法在Scala或Java中比Python更容易。



    这是HBase社区正在努力改进的一个领域,Google Cloud Dataproc将整合这些改进发生。


    Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions.

    The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found:

    The project is written in Scala 2.11 with SBT.

    Thanks for your help

    解决方案

    Update: SHC now seems to work with Spark 2 and the Table API. See https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

    Original answer:

    I don't believe any of these (or any other existing connector) will do all that you would like today.

    • spark-hbase will probably the right solution when it is release (HBase 1.4?), but currently only builds at head and is still working on Spark 2 support.
    • spark-hbase-connector only seems to support RDD APIs, but since they are more stable, might be somewhat helpful.
    • hortonworks-spark/shc probably won't work because I believe it only supports Spark 1 and uses the older HTable APIs which do not work with BigTable.

    I would recommend just using HBase MapReduce APIs with RDD methods like newAPIHadoopRDD (or possibly the spark-hbase-connector?). Then manually convert RDDs into DataSets. This approach is a lot easier in Scala or Java than Python.

    This is an area that the HBase community is working to improve and Google Cloud Dataproc will incorporate those improvements as they happen.

    这篇关于我应该使用哪种适用于Spark 2.0的HBase连接器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆