从猪的分组数据生成bigram组合 [英] generating bigram combinations from grouped data in pig

查看:107
本文介绍了从猪的分组数据生成bigram组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  raw:{userid:bytearray,itemid:bytearray} $ b在userid,itemid格式中输入数据
$ b dump raw; (A,2)
(A,2)
(A,4)
(A,5)
(B,2)
B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2) (C,5)})


我想要生成每个组中所有项目的组合(顺序不重要)。我最终打算对我的组中的项目执行jaccard相似性。



理想情况下,我的bigrams将被生成,然后我将输出FLATTEN看起来像:(A,(1,2))
(A,(1,3))
,(1,4))
(A,(2,3))
(A,(2,4))
(A,(3,4))
(B,(1,2))
(B,(2,3))
(B,(3,5))
(C,(1,5))

表示用户标识的字母ABC对于输出并不是必需的,我只是向他们展示为了说明的目的。从那里,我会计算每个二元组的出现次数以计算jaccard。我很想知道是否有其他人使用类似的calc(抱歉!),并且已经遇到过这种情况。



我看过提供的NGramGenerator与猪教程,但它并不真正匹配我想要完成的。我想知道是否可能是一个Python流式UDF是要走的路。

你一定要写一个UDF(在Python或Java中,两者都可以)。你会希望它能在一个包上工作,然后输出一个包(如果你扁平化一个包的话,你会得到输出行,所以它会给你输出你想要的)。

UDF本身不会太难......类似于

  letter,number = zip(* input_touples)
number = list(set(number)

在范围内(0,len(number)):
在范围内的j(len,number) ):
res.append((number [i],number [j]))

然后只是施放东西并且适当地返回它们。



如果你需要任何帮助来制作一个简单的python udf,那也不错。 a href =http://pig.apache.org/docs/r0.8.0/udf.html =nofollow> http://pig.apache.org/docs/r0.8.0/udf.html

当然可以在这里寻求更多帮助


given my input data in userid,itemid format:

raw: {userid: bytearray,itemid: bytearray}

dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})

I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.

ideally my the bigrams would be generated and then I'd FLATTEN the output to look like:

(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))

The letters ABC, which represent the userid, are not really necessary for the output, I'm just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I'd love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.

I've looked at the NGramGenerator that's supplied with the pig tutorials but it doesn't really match what I'm trying to accomplish. I'm wondering if perhaps a python streaming UDF is the way to go.

解决方案

You are definitely going to have to write a UDF (in Python or Java, either would be fine). You would want it to work on a bag, and then output a bag (if you flatten a bag of touples, you will get output rows so it will give you the output that you want).

the UDF itself would not be terribly difficult...something like

letter, number = zip(*input_touples)
number = list(set(number)

for i in range(0,len(number)):
    for j in range(i,len(number)):
        res.append((number[i],number[j]))

and then just cast things and return them appropriately.

If you need any help making a simple python udf, it's not too bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html

And of course feel free to ask for more help here

这篇关于从猪的分组数据生成bigram组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆