在Apache Pig中映射关键值元组的转换袋 [英] Transform bag of key-value tuples to map in Apache Pig

查看：224 发布时间：2017/5/21 17:33:22 map apache-pig

本文介绍了在Apache Pig中映射关键值元组的转换袋的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Pig的新手，我想在每个元组中将一包元组转换为具有特定值的地图作为关键。基本上我想改变：

{（id1，value1），（id2，value2），...} 到 [id1＃value1，id2＃value2]

我一直在网路上看一段时间，但我似乎找不到解决方案。我试过：

  bigQMap = FOREACH bigQFields GENERATE TOMAP（queryId，queryStart）;

但是我最终得到了一袋地图（例如 {[id1＃ value1]，[id2＃value2]，...} ），这不是我想要的。如何从一系列键值元组中建立地图？

以下是我正在尝试运行的特定脚本，以防其相关

  rawlines = LOAD'...'使用PigStorage（'`'）; 
 bigQFields = FOREACH bigQLogs GENERATE GFV（*，'queryId'）
作为queryId，GFV（*，'queryStart'）
作为queryStart; 
 bigQMap = ??如何使用queryId作为key和queryStart作为值？ ;

解决方案

TOMAP 需要一系列的对，并将它们转换成地图，所以它的用法就像：

 模式：A：{foo：chararray，bar：int，bing：chararray，bang：int} 
  - 数据：（John，27，Joe，30）
 B = FOREACH A GENERATE TOMAP ，bar，bing，bang）AS m; 
  - 模式：B：{m：map []} 
  - 数据：（John＃27，Joe＃30）

所以，您可以看到语法不支持将包转换为地图。据我所知，没有办法以纯猪的格式转换一个包。但是，您可以最终编写一个 java UDF 来执行此操作。

注意：我不太熟悉java，所以这个UDF可以很容易地改进（添加异常处理，如果键加了两次会发生什么等等。）。但是，它完成了你所需要的。

  package myudfs; 
 import java.io.IOException; 
 import org.apache.pig.EvalFunc; 
 
 import java.util.Map; 
 import java.util.HashMap; 
 import java.util.Iterator; 
 import org.apache.pig.data.Tuple; 
 import org.apache.pig.data.DataBag; 
 
 public class ConvertToMap extends EvalFunc< Map> 
 {
 public Map exec（Tuple input）throws IOException {
 DataBag values =（DataBag）input.get（0）; 
 Map< Object，Object> m = new HashMap< Object，Object>（）; 
 for（Iterator< Tuple> it = values.iterator（）; it.hasNext（）;）{
 Tuple t = it.next（）; 
 m.put（t.get（0），t.get（1））; 
} 
 return m; 
} 
}

将脚本编译到jar后，可以使用如下：

  REGISTER myudfs.jar; 
  -  A正在加载一些我做的样本数据
 A = LOAD'foo.in'AS（foo：{T：（id：chararray，value：chararray）}）; 
 B = FOREACH A GENERATE myudfs.ConvertToMap（foo）AS bar;

foo.in 的内容：

  {（open，apache），（apache，hadoop）} 
 {（foo，bar），（bar，foo ），（open，what）}

从 B

 （[open＃apache，apache＃hadoop]）
（[bar＃foo，open ＃what，foo＃bar]）

另一种方法是使用 python来创建UDF ：

myudfs.py

 ＃！/ usr / bin / python 
 
 @outputSchema（foo：map []）
 def BagtoMap（bag）：
d = {} 
为键值，包中的值为
d [key] = value 
 return d

这样使用：

 使用jython注册'myudfs.py'为myfuncs; 
  -  A仍然只是加载一些我的测试数据
 A = LOAD'foo.in'AS（foo：{T：（key：chararray，value：chararray）}）; 
 B = FOREACH A GENERATE myfuncs.BagtoMap（foo）;

并生成与Java UDF相同的输出。

奖金：
由于我不太喜欢地图，这里是一个链接，说明如何使用键值对来复制地图的功能。由于您的主要价值对在一个行李中，您需要在嵌套的 FOREACH 中执行类似地图的操作：

   -  A是一个包含kv_pairs的模式，一个以{（id，value）}形式的行李
 B = FOREACH A {
 temp = FOREACH kv_pairs GENERATE（key =='foo'？value：NULL）; 
  - 输出如下：（{（），（thevalue），（），（）}）
 
  -  MAX将从过滤的包中拉出最大值，即
  - 值（chararray）如果键匹配。否则返回NULL。 
 GENERATE MAX（temp）as kv_pairs_filtered; 
}

I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:

{(id1, value1),(id2, value2), ...} into [id1#value1, id2#value2]

I've been looking around online for a while, but I can't seem to find a solution. I've tried:

bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);

but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}), which is not what I want. How can I build up a map out of a bag of key-value tuple?

Below is the specific script I'm trying to run, in case it's relevant

rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
   as queryId, GFV(*, 'queryStart')
   as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;

解决方案

TOMAP takes a series of pairs and converts them into the map, so it is meant to be used like:

-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data:     (John,          27,      Joe,            30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data:     (John#27,Joe#30)

So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.

NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;

import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;

public class ConvertToMap extends EvalFunc<Map>
{
    public Map exec(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Map<Object, Object> m = new HashMap<Object, Object>();
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = it.next();
            m.put(t.get(0), t.get(1));
        }
        return m;
    }
}

Once you compile the script into a jar, it can be used like:

REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;

Contents of foo.in:

{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}

Output from B:

([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])

Another approach is to use python to create the UDF:

myudfs.py

#!/usr/bin/python

@outputSchema("foo:map[]")
def BagtoMap(bag):
    d = {}
    for key, value in bag:
        d[key] = value
    return d

Which is used like this:

Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;

And produces the same output as the Java UDF.

BONUS: Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH:

-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
    temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
    -- Output is like: ({(),(thevalue),(),()})

    -- MAX will pull the maximum value from the filtered bag, which is 
    -- value (the chararray) if the key matched. Otherwise it will return NULL.
    GENERATE MAX(temp) as kv_pairs_filtered ;
}

这篇关于在Apache Pig中映射关键值元组的转换袋的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Apache Pig中映射关键值元组的转换袋 [英] Transform bag of key-value tuples to map in Apache Pig

问题描述

myudfs.py

myudfs.py

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Apache Pig中映射关键值元组的转换袋 [英] Transform bag of key-value tuples to map in Apache Pig

问题描述

myudfs.py

myudfs.py

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭