PIG将文本行转换为稀疏矢量 [英] PIG convert text lines to sparse vector

查看:74
本文介绍了PIG将文本行转换为稀疏矢量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用Apache PIG将所需的文件组合在一起。第一个文件包含书名列表,这样每个标题都在行上。

I have to files that I need to combine together using Apache PIG. First file contains list of book titles, like this with each title being on the line by itself.

Ted Dunning,  Mahout in Action
Leo Tolstoy,  War and Peace
Douglas Adams, The hitchhiker guide to the galaxy.
James Sununu,  galaxy III for Dummies
Tom McArthur,  The War we went to



<第二个文件是单词及其ID的列表。像这样

the second file is the list of words and their IDs. Like this

ted, 12
tom, 13
douglas, 14
galaxy, 15
war, 16
leo, 17
peace, 18

我需要加入这两个文件来产生这样的输出:

I need to join these two files to produce the output like this:

对于'Leo Tolstoy,War and piece'这行应该会产生

for the line 'Leo Tolstoy, War and piece' it should produce

17:1,16:1,18:1

为'汤姆麦克阿瑟',我们去的战争'它应该产生

for the line 'Tom McArthur, The War we went to' it should produce

13:1,16:1

换句话说,我需要使用该词作为关键词来执行连接。到目前为止,我已经在猪身上写下了如下代码:

In other words, I need to perform the join using the word as a key. So far I've written the following code in pig

titles = LOAD 'Titles' AS ( title : chararray );  
termIDs = LOAD  'TermIDs' AS (  term:chararray,id:int);

A = SAMPLE titles 0.01;
X = FOREACH A GENERATE STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+');

这样可以加载两个文件,并且X包含每个袋子的列表,其中包含出现在相应的行。像这样:

This gives gets both files loaded and X contains the list of BAGS each bag containing the terms that occur on the corresponding line. Like this:

((ted,dunning,mahout,in,action))
((leo,tolstoy,war,and,peace))

由于周六晚上迟到的原因,我不能找出加入步骤的方式,无需编写UDF或使用流式传输。甚至有可能只使用PIG原语。

For the reason of being late on Saturday night, I can't figure out the way to JOIN step without writing a UDF or using streaming. Is it even possible to do using only PIG primitives.

推荐答案

您可以FLATTEN TOKENIZE的结果,因此所有包都成为行,现在您可以将X关系与termsID。

You can FLATTEN the results of the TOKENIZE, thus all of the bag become rows and now you can join the X relation with termsID.

X = foreach A generate title, flatten(TOKENIZE(title)) as term;
J = join X by (term),  termIDs by (term);
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

上面的代码是在我的手机上输入的,所以没有调试过。

The above code was typed on my mobile phone, so it was not debugged.

更新1:

对于最好使用STRSPLIT而不是TOKENIZE的情况,您可以FLATTEN和TOBAG的组合可以达到与TOKENIZE相同的效果,TOKENIZE可以从STRSPLIT返回的元组中获取一堆单词。

For cases when it is preferable to use STRSPLIT instead of TOKENIZE you could do a combination of FLATTEN and TOBAG to achieve the same effect as TOKENIZE, which is getting a bag of words from a tuple returned by STRSPLIT.

SPLT = foreach A generate title, FLATTEN(STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+'));
X_tmp = foreach SPLT generate $0 as title, FLATTEN(TOBAG($1..$20)) as term; -- pivots the row
X = filter X_tmp by term is not null; -- this removes the extra bag rows when title was split in less than 20 terms
J = join X by (term),  termIDs by (term) using 'replicated';
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

如果任何标题超过20个字词,则会增加TOBAG中的数量。

If any of the title exceeds 20 terms than increase the number in the TOBAG.

这篇关于PIG将文本行转换为稀疏矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆