将键值元组包转换为在 Apache Pig 中映射 [英] Transform bag of key-value tuples to map in Apache Pig

查看:20
本文介绍了将键值元组包转换为在 Apache Pig 中映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Pig 的新手,我想将一包元组转换为一个映射,每个元组中的特定值作为键.基本上我想改变:

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:

{(id1, value1),(id2, value2), ...} 变成 [id1#value1, id2#value2]

我在网上找了一段时间,但似乎找不到解决方案.我试过了:

I've been looking around online for a while, but I can't seem to find a solution. I've tried:

bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);

但我最终得到了一袋地图(例如 {[id1#value1], [id2#value2], ...}),这不是我想要的.如何从一袋键值元组中构建映射?

but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}), which is not what I want. How can I build up a map out of a bag of key-value tuple?

以下是我尝试运行的特定脚本,以防万一

Below is the specific script I'm trying to run, in case it's relevant

rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
   as queryId, GFV(*, 'queryStart')
   as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;

推荐答案

TOMAP 获取一系列对并将它们转换为映射,因此它的使用方式如下:

TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:

-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data:     (John,          27,      Joe,            30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data:     (John#27,Joe#30)

如您所见,语法不支持将包转换为地图.据我所知,无法以纯猪映射的格式转换包.但是,您可以明确地编写一个 java UDF这样做.

So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.

注意:我对 Java 没有太多经验,因此可以轻松改进此 UDF(添加异常处理,如果键添加两次会发生什么等).但是,它确实可以完成您需要的工作.

NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;

import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;

public class ConvertToMap extends EvalFunc<Map>
{
    public Map exec(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Map<Object, Object> m = new HashMap<Object, Object>();
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = it.next();
            m.put(t.get(0), t.get(1));
        }
        return m;
    }
}

一旦你将脚本编译成 jar,它就可以像这样使用:

Once you compile the script into a jar, it can be used like:

REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;

foo.in 的内容:

{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}

来自B的输出:

([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])

<小时>

另一种方法是使用 python 创建 UDF:

#!/usr/bin/python

@outputSchema("foo:map[]")
def BagtoMap(bag):
    d = {}
    for key, value in bag:
        d[key] = value
    return d

像这样使用:

Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;

并产生与 Java UDF 相同的输出.

And produces the same output as the Java UDF.

奖励:由于我不太喜欢地图,这里是一个解释功能如何的链接可以只用键值对复制地图的.由于您的键值对在一个包中,您需要在嵌套的 FOREACH 中执行类似地图的操作:

BONUS: Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:

-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
    temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
    -- Output is like: ({(),(thevalue),(),()})

    -- MAX will pull the maximum value from the filtered bag, which is 
    -- value (the chararray) if the key matched. Otherwise it will return NULL.
    GENERATE MAX(temp) as kv_pairs_filtered ;
}

这篇关于将键值元组包转换为在 Apache Pig 中映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆