RDD只有第一列值:Hbase,PySpark [英] RDD is having only first column value : Hbase, PySpark

查看:209
本文介绍了RDD只有第一列值:Hbase,PySpark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用以下命令阅读一个使用Pyspark的Hbase表。

  from pyspark.sql.types import * 
host =<主机名称>
port =< Port Number>

keyConv =org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter
valueConv =org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter

cmdata_conf = {hbase.zookeeper.property.clientPort:port,hbase.zookeeper.quorum:host,hbase.mapreduce.inputtable:CMData,hbase.mapreduce.scan.columns:info:年龄}

cmdata_rdd = sc.newAPIHadoopRDD(org.apache.hadoop.hbase.mapreduce.TableInputFormat,org.apache.hadoop.hbase.io.ImmutableBytesWritable,org .apache.hadoop.hbase.client.Result,keyConverter = keyConv,valueConverter = valueConv,conf = cmdata_conf)

output = cmdata_rdd.collect()

输出

我得到如下结果。 (密码和年龄)

  [(u'123',u'5'),(u'234',u' (u'345',u'3'),(u'456',u'4'),(u'567',u'7'),(u'678',u'7' ),(u'789',u'8')] 

相反,我期望Key,Tenure和年龄。如果我只有终身职位列,那么它将返回键和终身职位。但是,如果添加更多列,结果总是包含Key和Age列。



任何人都可以帮助我们解决这个问题。?

注意:我们对此工具不熟悉



预先感谢您。 解决方案

如果您正在进行原型设计,并且不想更新群集,请查看happybase( https://happybase.readthedocs.org/en/latest/ )。



从我的集群中获取我的小型(9Gig)Hbase表'name_Hbase_Table'。

 导入happybase 
connection = happybase.Connection(host ='your.ip.cluster')#不指定:port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table):#create UDF格式化数据
finalTable = []
为table.scan()中的键,数据:#在我的情况下不需要键
line = []
for data.itervalues():
line.append(values)
finalTable.append(line)
return finalTable
table = table .map(hbaseAccelerationParser)#获取所需格式的数据
table = sc.parallelize(table,4)#put in RDD


We are reading a Hbase table with Pyspark using the following commands.

from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>

keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}

cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

output = cmdata_rdd.collect()

output

I am getting the result as below. (Key and Age)

[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]

Instead am expecting Key,Tenure and Age. If I have only Tenure column then its returning Key and Tenure. But If add more columns the result always has Key and Age column.

Can anyone help us to solve this one.?

Note : We are new to this tools

Thank you in advance.

解决方案

I you're prototyping and don't want to update your cluster, it can be useful to have a look at happybase (https://happybase.readthedocs.org/en/latest/).

The following code does the trick to get my small (9Gig) Hbase table 'name_Hbase_Table' from my cluster in under a second.

import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
    finalTable=[]
    for key, data in table.scan(): #don't need the key in my case
        line=[]
        for values in data.itervalues():
            line.append(values)
        finalTable.append(line)
    return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD

这篇关于RDD只有第一列值:Hbase,PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆