什么是删除“字符串表达式”的过程/代码?从使用Apache Pig的文件? [英] What can be the procedure/code to remove "string expression" from a file using Apache Pig?

查看:186
本文介绍了什么是删除“字符串表达式”的过程/代码?从使用Apache Pig的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

A = load '/home/wrdtest.txt';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word != 'the';

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

我只想过滤文字。第3步过滤文本并从文本文件中删除'the'。但是我想从文本中删除一组499个单词(停用词)。我尝试使用'|'(如OR):

I just want to filter the text . The 3rd step filters the text and removes 'the' from text file. But i want to remove a set of 499 words (stop words) from the text. I tried to use '|' (as OR ) like :

C = filter B by word != 'the|and|or'...but it didnt work.

您能否就此提出建议,并可以按照顺序包含一个文本文件(如stopwords.txt)删除停用词。

Can you please suggest on this and may i include a text file like (stopwords.txt) in order to remove the stop words.

我是Pig的一个天真的用户

I am a naive user of Pig

推荐答案

像移除停用词这样的东西很复杂,它不会在内置函数中。您需要编写一个用户定义的函数,这相当于很简单。

Something like removing stop words is complicated enough that it is not going to be in the built-in functions. You'll need to write a user-defined function, which is quite simple to do.

-- load the data line by line
lines = LOAD 'datafile.txt' USING TextLoader() AS (line:chararray);

-- apply some sort of UDF that returns the exact line without the stop words
nostop = FOREACH lines GENERATE myudfs.removestop(line);

-- store the data out
STORE nonstop INTO 'datafile_nostop.txt';

将您的列表推送给任务是另一回事。如果列表相对较小,数量级为几千,则可以将停用词烘焙到您的代码中(即,对列表进行硬编码),以使其可用。否则,您可以使用分布式缓存将文件推出。

Pushing that list of yours out to the tasks is another story. If the list is relatively small, in the order of thousands, you can bake the stop words into your code (i.e., hardcoding the list) so that it has it available. Otherwise, you could use the Distributed Cache to push the file out.

随着您提供的信息越来越多,我可以推荐另一种方法。不过,我上面的使用UDF的方法仍然有效。

With the more information you provided, I can suggest an alternative approach. My above approach of using a UDF is still valid, though.

这种新方法将涉及加载其他文件,然后有效地进行反连接以删除匹配列表。您需要确保 stopwords.txt 每行有一个单词才能使其工作。为了做反连接(即保持与其他列表不匹配的列表),我会做一个 left outer join (使用复制),然后过滤出停用词栏为空的位置(即,它没有匹配的停用词)。

This new approach will involve you loading your other file, then effectively doing an anti-join to remove things that match the list. You need to make sure stopwords.txt has one word per line in order for this to work. To do the anti-join (i.e., keep the things the list that do not match the other list), I'll do a left outer join (using replicated), then filter out where the stop word column is null (i.e., it did not have a stop word that matched).

A = load '/home/wrdtest.txt';

-- load the stop words list
SW = load '/home/stopwords.txt' as (stopword:chararray);    

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

-- join the data with a left outer join
-- using replicated should be done with the right relation (SW) is small
SW2 = join B by word LEFT OUTER, SW by stopword USING 'replicated';

-- filter out where the stopword is null, meaning it is not in the stopword list
C = filter SW2 by stopword IS NULL;

-- remove the stopword column that we don't need.
C = foreach C generate word;

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

这篇关于什么是删除“字符串表达式”的过程/代码?从使用Apache Pig的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆