Apache Pig:动态列 [英] Apache Pig: Dynamic columns

查看:164
本文介绍了Apache Pig:动态列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集(CSV),它有三个值列(v1,2和3)和一个值。该值的描述在'keys'列中以逗号分隔的字符串存储。

I've a dataset (CSV) that has three value columns (v1, 2 and 3) with a value. The description of the value is stored as a comma separated string in the column 'keys'.

| v1 | v2 | v3 | keys  |
| A  | C  | E  | X,Y,Z |

使用Pig我想将这些信息加载到一个HBase表中,其中列族是C和列限定符是关键。

Using Pig I would like to load this information in a HBase table where the Column Family is C and the Column Qualifier is the key.

| C:X | C:Y | C:Z |
| A   | C   | E   |

有没有人曾经这样做过,并希望分享这些知识?

Has anyone done this before and would like to share this knowledge?

另一个选项是将映射(键#值)存储在HBase列中。但我不确定这是否可以灵活地查询数据?

Another option is to store a map (key#value) in a HBase column. But I'm not sure if this is flexible for querying the data?

推荐答案

找到我的问题的解决方案

Found a solution to my problem

test.pig:

REGISTER data.py using jython as myfuncs

A = LOAD 'data' using PigStorage('|') AS (
    id:chararray,
    date:chararray,
    v1:chararray,
    v2:chararray,
    v3:chararray,
    keys:chararray,
);

B = FOREACH A {
GENERATE
    id,
    date,
    myfuncs.dataToMap(STRSPLIT(keys, ','), TOTUPLE(v1, v2, v3)) as kv;
}

STORE B INTO 'pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'e:date kv:*' );

data.py:

import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil

@outputSchema("ud:map[]")
def dataToMap(keys, values):

result = dict()
keys = list(keys)
values = list(values)

try:
    while True:
        values.remove(None)
except ValueError:
    pass

for idx in range(len(keys)):
    result[keys[idx]] = values[idx]

return result

这篇关于Apache Pig:动态列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆