如何使用Python连接HBase和Spark? [英] How to connect HBase and Spark using Python?

查看:2602
本文介绍了如何使用Python连接HBase和Spark?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个令人尴尬的并行任务,我使用Spark来分配计算。这些计算使用Python,我使用PySpark来读取和预处理数据。我的任务的输入数据存储在HBase中。不幸的是,我还没有找到一种令人满意的(即易于使用和可扩展的)方法来使用Python来读/写Spark中的HBase数据。



我的以前已经探讨过:


  • 使用 happybase 。该软件包允许使用HBase的Thrift API从Python连接HBase。这样,我基本上跳过Spark进行数据读取/写入,并且错过了潜在的HBase-Spark优化。读取速度似乎相当快,但写入速度很慢。 使用SparkContext的 newAPIHadoopRDD saveAsNewAPIHadoopDataset / code>,它们使用HBase的MapReduce接口。这个例子曾经包含在Spark代码库中(见这里)。但是,现在认为这些已过时,以支持HBase的Spark绑定(请参阅此处)。我还发现这种方法很慢并且很麻烦(对于阅读,写作工作很好),例如,从 newAPIHadoopRDD 返回的字符串必须被分析和转换为各种方法来结束我想要的Python对象。




    • 我目前使用Cloudera的CDH,版本5.7.0提供 hbase-spark CDH发布说明详细的博客文章)。这个模块(以前称为 SparkOnHBase )将正式成为HBase 2.0的一部分。不幸的是,这个美妙的解决方案似乎只适用于Scala / Java。

    • 华为 Spark-SQL-on-HBase / Astro (我没有看到两者之间的区别......)。它看起来并不像我希望我的解决方案那样稳健且得到很好的支持。

我发现这个评论 hbase-spark 的制造商之一,这似乎表明有一个



事实上,

这里描述的模式可以应用于使用PySpark使用Spark SQL查询HBase。  from pyspark import SparkContext $ b $ from pyspark.sql import SQLContext 
$ b $ sc = SparkContext()
sqlc = SQLContext(sc)

data_source_format ='org.apache.hadoop.hbase.spark'

df = sc.parallelize([('a ','1.0'),('b','2.0')])。toDF(schema = ['col0','col1'])

#''.join(string.split ()),以便在此处编写多行JSON字符串。
catalog =''.join({b $ btable:{namespace:default,name:testtable},
rowkey:键,
列:{
col0:{cf:rowkey,col:key,type:string},
col1:{cf:cf,col:col1,type:string}
}
}。split())


#写入
df.write \
.options(catalog = catalog)\#或者:.option('catalog',catalog)
。格式(data_source_format)\
.save()

#读取
df = sqlc.read\
.options(catalog = catalog)\
.format(data_source_format)\
.load()

我试过 hbase-spark-1.2.0-cdh5.7.0.jar (由Cloudera分发),但遇到了麻烦( org.apache。 hadoop.hbase.spark.DefaultSource在编写时不允许create table为select ,读取时为 java.util.NoSuchElementException:None.get )。事实证明,目前版本的CDH不包括允许Spark SQL-HBase集成的对 hbase-spark 的更改。



对我的工作是 shc Spark包,找到这里。我必须对上述脚本进行的唯一更改是:

  data_source_format ='org.apache.spark.sql。 execution.datasources.hbase'

以下是我如何在我的CDH群集上提交上述脚本,从 shc README:

  spark-submit --packages com。 hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files / opt / cloudera / parcels / CDH / lib / hbase / conf / hbase-site.xml example.py 

大部分 shc 似乎已经合并到HBase的 hbase-spark 模块中,并在2.0版本中发布。因此,Spark SQL查询HBase可以使用上述模式(请参阅: https://详情请参考hbase.apache.org/book.html#_sparksql_dataframes )。我上面的例子显示了PySpark用户的样子。



最后,一个警告:我的上面的示例数据只有字符串。 shc 不支持Python数据转换,所以我遇到了整数和浮点数在HBase中没有显示或奇怪值的问题。


I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet to find a satisfactory (i.e., easy to use and scalable) way to read/write HBase data from/to Spark using Python.

What I've explored previously:

  • Connecting from within my Python processes using happybase. This package allows connecting to HBase from Python by using HBase's Thrift API. This way, I basically skip Spark for data reading/writing and am missing out on potential HBase-Spark optimizations. Read speeds seem reasonably fast, but write speeds are slow. This is currently my best solution.

  • Using SparkContext's newAPIHadoopRDD and saveAsNewAPIHadoopDataset that make use of HBase's MapReduce interface. Examples for this were once included in the Spark code base (see here). However, these are now considered outdated in favor of HBase's Spark bindings (see here). I've also found this method to be slow and cumbersome (for reading, writing worked well), for example as the strings returned from newAPIHadoopRDD had to be parsed and transformed in various ways to end up with the Python objects I wanted.

Alternatives that I'm aware of:

  • I'm currently using Cloudera's CDH and version 5.7.0 offers hbase-spark (CDH release notes, and a detailed blog post). This module (formerly known as SparkOnHBase) will officially be a part of HBase 2.0. Unfortunately, this wonderful solution seems to work only with Scala/Java.

  • Huawei's Spark-SQL-on-HBase / Astro (I don't see a difference between the two...). It does not look as robust and well-supported as I'd like my solution to be.

解决方案

I found this comment by one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlc = SQLContext(sc)

data_source_format = 'org.apache.hadoop.hbase.spark'

df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])

# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"testtable"},
    "rowkey":"key",
    "columns":{
        "col0":{"cf":"rowkey", "col":"key", "type":"string"},
        "col1":{"cf":"cf", "col":"col1", "type":"string"}
    }
}""".split())


# Writing
df.write\
.options(catalog=catalog)\  # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()

# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()

I've tried hbase-spark-1.2.0-cdh5.7.0.jar (as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select when writing, java.util.NoSuchElementException: None.get when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark that allow Spark SQL-HBase integration.

What does work for me is the shc Spark package, found here. The only change I had to make to the above script is to change:

data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'

Here's how I submit the above script on my CDH cluster, following the example from the shc README:

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py

Most of the work on shc seems to already be merged into the hbase-spark module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.

Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

这篇关于如何使用Python连接HBase和Spark?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆