在Apache Pig中映射关键值元组的转换袋 [英] Transform bag of key-value tuples to map in Apache Pig
问题描述
{(id1,value1),(id2,value2),...}
到 [id1#value1,id2#value2]
我一直在网路上看一段时间,但我似乎找不到解决方案。我试过:
bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId,queryStart);
但是我最终得到了一袋地图(例如 {[id1# value1],[id2#value2],...}
),这不是我想要的。如何从一系列键值元组中建立地图?
以下是我正在尝试运行的特定脚本,以防其相关
p> rawlines = LOAD'...'使用PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
作为queryId,GFV(*,'queryStart')
作为queryStart;
bigQMap = ??如何使用queryId作为key和queryStart作为值? ;
TOMAP
需要一系列的对,并将它们转换成地图,所以它的用法就像:
模式:A:{foo:chararray,bar:int,bing:chararray,bang:int}
- 数据:(John,27,Joe,30)
B = FOREACH A GENERATE TOMAP ,bar,bing,bang)AS m;
- 模式:B:{m:map []}
- 数据:(John#27,Joe#30)
所以,您可以看到语法不支持将包转换为地图。据我所知,没有办法以纯猪的格式转换一个包。但是,您可以最终编写一个 java UDF 来执行此操作。
注意:我不太熟悉java,所以这个UDF可以很容易地改进(添加异常处理,如果键加了两次会发生什么等等。)。但是,它完成了你所需要的。
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
public class ConvertToMap extends EvalFunc< Map>
{
public Map exec(Tuple input)throws IOException {
DataBag values =(DataBag)input.get(0);
Map< Object,Object> m = new HashMap< Object,Object>();
for(Iterator< Tuple> it = values.iterator(); it.hasNext();){
Tuple t = it.next();
m.put(t.get(0),t.get(1));
}
return m;
}
}
将脚本编译到jar后,可以使用如下:
REGISTER myudfs.jar;
- A正在加载一些我做的样本数据
A = LOAD'foo.in'AS(foo:{T:(id:chararray,value:chararray)});
B = FOREACH A GENERATE myudfs.ConvertToMap(foo)AS bar;
foo.in
的内容:
{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo ),(open,what)}
从 B
([open#apache,apache#hadoop])
([bar#foo,open #what,foo#bar])
另一种方法是使用 python来创建UDF :
myudfs.py
#!/ usr / bin / python
@outputSchema(foo:map [])
def BagtoMap(bag):
d = {}
为键值,包中的值为
d [key] = value
return d
这样使用:
使用jython注册'myudfs.py'为myfuncs;
- A仍然只是加载一些我的测试数据
A = LOAD'foo.in'AS(foo:{T:(key:chararray,value:chararray)});
B = FOREACH A GENERATE myfuncs.BagtoMap(foo);
并生成与Java UDF相同的输出。
奖金:
由于我不太喜欢地图,这里是一个链接,说明如何使用键值对来复制地图的功能。由于您的主要价值对在一个行李中,您需要在嵌套的 FOREACH
中执行类似地图的操作:
- A是一个包含kv_pairs的模式,一个以{(id,value)}形式的行李
B = FOREACH A {
temp = FOREACH kv_pairs GENERATE(key =='foo'?value:NULL);
- 输出如下:({(),(thevalue),(),()})
- MAX将从过滤的包中拉出最大值,即
- 值(chararray)如果键匹配。否则返回NULL。
GENERATE MAX(temp)as kv_pairs_filtered;
}
I am new to Pig and I want to convert a bag of tuples to a map with specific value in each tuple as key. Basically I want to change:
{(id1, value1),(id2, value2), ...}
into [id1#value1, id2#value2]
I've been looking around online for a while, but I can't seem to find a solution. I've tried:
bigQMap = FOREACH bigQFields GENERATE TOMAP(queryId, queryStart);
but I end up with a bag of maps (e.g. {[id1#value1], [id2#value2], ...}
), which is not what I want. How can I build up a map out of a bag of key-value tuple?
Below is the specific script I'm trying to run, in case it's relevant
rawlines = LOAD '...' USING PigStorage('`');
bigQFields = FOREACH bigQLogs GENERATE GFV(*,'queryId')
as queryId, GFV(*, 'queryStart')
as queryStart;
bigQMap = ?? how to make a map with queryId as key and queryStart as value ?? ;
TOMAP
takes a series of pairs and converts them into the map, so it is meant to be used like:
-- Schema: A:{foo:chararray, bar:int, bing:chararray, bang:int}
-- Data: (John, 27, Joe, 30)
B = FOREACH A GENERATE TOMAP(foo, bar, bing, bang) AS m ;
-- Schema: B:{m: map[]}
-- Data: (John#27,Joe#30)
So as you can see the syntax does not support converting a bag to a map. As far as I know there is no way to convert a bag in the format you have to map in pure pig. However, you can definitively write a java UDF to do this.
NOTE: I'm not too experienced with java, so this UDF can easily be improved on (adding exception handling, what happens if a key added twice etc.). However, it does accomplish what you need it to.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataBag;
public class ConvertToMap extends EvalFunc<Map>
{
public Map exec(Tuple input) throws IOException {
DataBag values = (DataBag)input.get(0);
Map<Object, Object> m = new HashMap<Object, Object>();
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
m.put(t.get(0), t.get(1));
}
return m;
}
}
Once you compile the script into a jar, it can be used like:
REGISTER myudfs.jar ;
-- A is loading some sample data I made
A = LOAD 'foo.in' AS (foo:{T:(id:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myudfs.ConvertToMap(foo) AS bar;
Contents of foo.in
:
{(open,apache),(apache,hadoop)}
{(foo,bar),(bar,foo),(open,what)}
Output from B
:
([open#apache,apache#hadoop])
([bar#foo,open#what,foo#bar])
Another approach is to use python to create the UDF:
myudfs.py
#!/usr/bin/python
@outputSchema("foo:map[]")
def BagtoMap(bag):
d = {}
for key, value in bag:
d[key] = value
return d
Which is used like this:
Register 'myudfs.py' using jython as myfuncs;
-- A is still just loading some of my test data
A = LOAD 'foo.in' AS (foo:{T:(key:chararray, value:chararray)}) ;
B = FOREACH A GENERATE myfuncs.BagtoMap(foo) ;
And produces the same output as the Java UDF.
BONUS:
Since I don't like maps very much, here is a link explaining how the functionality of a map can be replicated with just key value pairs. Since your key value pairs are in a bag, you'll need to do the map-like operations in a nested FOREACH
:
-- A is a schema that contains kv_pairs, a bag in the form {(id, value)}
B = FOREACH A {
temp = FOREACH kv_pairs GENERATE (key=='foo'?value:NULL) ;
-- Output is like: ({(),(thevalue),(),()})
-- MAX will pull the maximum value from the filtered bag, which is
-- value (the chararray) if the key matched. Otherwise it will return NULL.
GENERATE MAX(temp) as kv_pairs_filtered ;
}
这篇关于在Apache Pig中映射关键值元组的转换袋的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!