从猪的分组数据生成二元组合 [英] generating bigram combinations from grouped data in pig

查看:29
本文介绍了从猪的分组数据生成二元组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定我输入的 userid,itemid 格式的数据:

given my input data in userid,itemid format:

raw: {userid: bytearray,itemid: bytearray}

dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})

我想生成每个组内项目的所有组合(顺序不重要).我最终打算对我组中的项目执行 jaccard 相似度.

I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.

理想情况下,我会生成二元组,然后我会将输出平展为如下所示:

ideally my the bigrams would be generated and then I'd FLATTEN the output to look like:

(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))

代表用户 ID 的字母 ABC 并不是输出所必需的,我只是为了说明目的而显示它们.从那里,我将计算每个二元组的出现次数以计算 jaccard.我很想知道是否有其他人正在使用 pig 进行类似的相似度计算(抱歉!)并且已经遇到过这种情况.

The letters ABC, which represent the userid, are not really necessary for the output, I'm just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I'd love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.

我查看了 Pig 教程随附的 NGramGenerator,但它与我想要完成的内容并不真正匹配.我想知道 Python 流式 UDF 是否可行.

I've looked at the NGramGenerator that's supplied with the pig tutorials but it doesn't really match what I'm trying to accomplish. I'm wondering if perhaps a python streaming UDF is the way to go.

推荐答案

您肯定需要编写 UDF(使用 Python 或 Java,两者都可以).你会希望它处理一个包,然后输出一个包(如果你压平一包元组,你会得到输出行,所以它会给你你想要的输出).

You are definitely going to have to write a UDF (in Python or Java, either would be fine). You would want it to work on a bag, and then output a bag (if you flatten a bag of touples, you will get output rows so it will give you the output that you want).

UDF 本身不会非常困难......像

the UDF itself would not be terribly difficult...something like

letter, number = zip(*input_touples)
number = list(set(number)

for i in range(0,len(number)):
    for j in range(i,len(number)):
        res.append((number[i],number[j]))

然后就扔东西并适当地返回它们.

and then just cast things and return them appropriately.

如果您在制作简单的 python udf 时需要任何帮助,那还不错.检查这里:http://pig.apache.org/docs/r0.8.0/udf.html

If you need any help making a simple python udf, it's not too bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html

当然也可以在这里寻求更多帮助

And of course feel free to ask for more help here

这篇关于从猪的分组数据生成二元组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆