从包中选择随机元组 [英] Selecting random tuple from bag
问题描述
是否有可能(有效地)从猪的袋子中选择一个随机的元组? 我可以只拿一个袋子的第一个结果(因为它是无序的),但就我而言,我需要一个适当的随机选择. 一种(效率不高)的解决方案是计算袋子中的元组数量,在该范围内取一个随机数,遍历袋子,并在迭代次数与我的随机数匹配时停止.有人知道更快/更好的方法吗?
Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
推荐答案
您可以在嵌套的FOREACH语句中使用RANDOM(),ORDER和LIMIT来选择一个随机数最小的元素:
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
ordered_rnds = order rnds by rnd;
one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
generate group as id, one_tuple;
};
转储随机数;
输入:
1 a r
1 a t
1 b r
1 b 4
1 e 4
1 h 4
1 k t
2 k k
2 j j
3 a r
3 e l
3 j l
4 a r
4 b t
4 b g
4 h b
4 j d
5 h k
输出:
(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})
如果您运行转储随机变量";多次,每次运行您应获得不同的结果.
If you run "dump randoms;" multiple times, you should get different results for each run.
编写UDF可能会给您带来更好的性能,因为您无需在袋子内随机进行次要排序.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.
这篇关于从包中选择随机元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!