RDD只有第一列值:Hbase,PySpark [英] RDD is having only first column value : Hbase, PySpark
问题描述
我们正在使用以下命令阅读一个使用Pyspark的Hbase表。
from pyspark.sql.types import *
host =<主机名称>
port =< Port Number>
keyConv =org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter
valueConv =org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter
cmdata_conf = {hbase.zookeeper.property.clientPort:port,hbase.zookeeper.quorum:host,hbase.mapreduce.inputtable:CMData,hbase.mapreduce.scan.columns:info:年龄}
cmdata_rdd = sc.newAPIHadoopRDD(org.apache.hadoop.hbase.mapreduce.TableInputFormat,org.apache.hadoop.hbase.io.ImmutableBytesWritable,org .apache.hadoop.hbase.client.Result,keyConverter = keyConv,valueConverter = valueConv,conf = cmdata_conf)
output = cmdata_rdd.collect()
输出
我得到如下结果。 (密码和年龄)
[(u'123',u'5'),(u'234',u' (u'345',u'3'),(u'456',u'4'),(u'567',u'7'),(u'678',u'7' ),(u'789',u'8')]
相反,我期望Key,Tenure和年龄。如果我只有终身职位列,那么它将返回键和终身职位。但是,如果添加更多列,结果总是包含Key和Age列。
任何人都可以帮助我们解决这个问题。?
注意:我们对此工具不熟悉
预先感谢您。
如果您正在进行原型设计,并且不想更新群集,请查看happybase( https://happybase.readthedocs.org/en/latest/ )。
从我的集群中获取我的小型(9Gig)Hbase表'name_Hbase_Table'。
导入happybase
connection = happybase.Connection(host ='your.ip.cluster')#不指定:port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table):#create UDF格式化数据
finalTable = []
为table.scan()中的键,数据:#在我的情况下不需要键
line = []
for data.itervalues():
line.append(values)
finalTable.append(line)
return finalTable
table = table .map(hbaseAccelerationParser)#获取所需格式的数据
table = sc.parallelize(table,4)#put in RDD
We are reading a Hbase table with Pyspark using the following commands.
from pyspark.sql.types import *
host=<Host Name>
port=<Port Number>
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "CMData", "hbase.mapreduce.scan.columns": "info:Tenure info:Age"}
cmdata_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
output = cmdata_rdd.collect()
output
I am getting the result as below. (Key and Age)
[(u'123', u'5'), (u'234', u'4'), (u'345', u'3'), (u'456', u'4'), (u'567', u'7'), (u'678', u'7'), (u'789', u'8')]
Instead am expecting Key,Tenure and Age. If I have only Tenure column then its returning Key and Tenure. But If add more columns the result always has Key and Age column.
Can anyone help us to solve this one.?
Note : We are new to this tools
Thank you in advance.
I you're prototyping and don't want to update your cluster, it can be useful to have a look at happybase (https://happybase.readthedocs.org/en/latest/).
The following code does the trick to get my small (9Gig) Hbase table 'name_Hbase_Table' from my cluster in under a second.
import happybase
connection = happybase.Connection(host ='your.ip.cluster') #don't specify :port
table = connection.table('name_Hbase_Table')
def hbaseAccelerationParser(table): #create UDF to format data
finalTable=[]
for key, data in table.scan(): #don't need the key in my case
line=[]
for values in data.itervalues():
line.append(values)
finalTable.append(line)
return finalTable
table =table.map(hbaseAccelerationParser) #capture data in desired format
table = sc.parallelize(table ,4) #put in RDD
这篇关于RDD只有第一列值:Hbase,PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!