PIG 将文本行转换为稀疏向量 [英] PIG convert text lines to sparse vector

查看:30
本文介绍了PIG 将文本行转换为稀疏向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用 Apache PIG 将文件组合在一起.第一个文件包含书名列表,就像这样,每个书名都在一行中.

I have to files that I need to combine together using Apache PIG. First file contains list of book titles, like this with each title being on the line by itself.

Ted Dunning,  Mahout in Action
Leo Tolstoy,  War and Peace
Douglas Adams, The hitchhiker guide to the galaxy.
James Sununu,  galaxy III for Dummies
Tom McArthur,  The War we went to

第二个文件是单词及其 ID 的列表.像这样

the second file is the list of words and their IDs. Like this

ted, 12
tom, 13
douglas, 14
galaxy, 15
war, 16
leo, 17
peace, 18

我需要加入这两个文件来产生这样的输出:

I need to join these two files to produce the output like this:

对于'Leo Tolstoy, War andpiece'这行应该产生

for the line 'Leo Tolstoy, War and piece' it should produce

17:1,16:1,18:1

对于汤姆麦克阿瑟,我们去的战争"这行应该制作

for the line 'Tom McArthur, The War we went to' it should produce

13:1,16:1

换句话说,我需要使用单词作为键来执行连接.到目前为止,我已经在 pig 中编写了以下代码

In other words, I need to perform the join using the word as a key. So far I've written the following code in pig

titles = LOAD 'Titles' AS ( title : chararray );  
termIDs = LOAD  'TermIDs' AS (  term:chararray,id:int);

A = SAMPLE titles 0.01;
X = FOREACH A GENERATE STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+');

这给出了加载的两个文件,X 包含 BAGS 的列表,每个包包含出现在相应行上的术语.像这样:

This gives gets both files loaded and X contains the list of BAGS each bag containing the terms that occur on the corresponding line. Like this:

((ted,dunning,mahout,in,action))
((leo,tolstoy,war,and,peace))

由于周六晚上迟到的原因,不写UDF或者不使用流,我想不出JOIN步骤的方法.甚至可以只使用 PIG 原语.

For the reason of being late on Saturday night, I can't figure out the way to JOIN step without writing a UDF or using streaming. Is it even possible to do using only PIG primitives.

推荐答案

您可以将 TOKENIZE 的结果进行平展,这样所有的包都变成了行,现在您可以将 X 关系与 termID 连接起来.

You can FLATTEN the results of the TOKENIZE, thus all of the bag become rows and now you can join the X relation with termsID.

X = foreach A generate title, flatten(TOKENIZE(title)) as term;
J = join X by (term),  termIDs by (term);
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

以上代码是我手机打的,没有调试.

The above code was typed on my mobile phone, so it was not debugged.

更新 1:

对于最好使用 STRSPLIT 而不是 TOKENIZE 的情况,您可以结合使用 FLATTEN 和 TOBAG 来实现与 TOKENIZE 相同的效果,后者从 STRSPLIT 返回的元组中获取一袋单词.

For cases when it is preferable to use STRSPLIT instead of TOKENIZE you could do a combination of FLATTEN and TOBAG to achieve the same effect as TOKENIZE, which is getting a bag of words from a tuple returned by STRSPLIT.

SPLT = foreach A generate title, FLATTEN(STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+'));
X_tmp = foreach SPLT generate $0 as title, FLATTEN(TOBAG($1..$20)) as term; -- pivots the row
X = filter X_tmp by term is not null; -- this removes the extra bag rows when title was split in less than 20 terms
J = join X by (term),  termIDs by (term) using 'replicated';
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

如果任何标题超过 20 个术语,则增加 TOBAG 中的数量.

If any of the title exceeds 20 terms than increase the number in the TOBAG.

这篇关于PIG 将文本行转换为稀疏向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆