删除“字符串表达式"的过程/代码是什么?从使用 Apache Pig 的文件? [英] What can be the procedure/code to remove "string expression" from a file using Apache Pig?

查看:21
本文介绍了删除“字符串表达式"的过程/代码是什么?从使用 Apache Pig 的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

A = load '/home/wrdtest.txt';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = filter B by word != 'the';

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

我只想过滤文本.第三步过滤文本并从文本文件中删除the".但我想从文本中删除一组 499 个单词(停用词).我尝试使用|"(作为或)喜欢:

I just want to filter the text . The 3rd step filters the text and removes 'the' from text file. But i want to remove a set of 499 words (stop words) from the text. I tried to use '|' (as OR ) like :

C = filter B by word != 'the|and|or'...but it didnt work.

您能否就此提出建议,我可以包含一个文本文件,例如 (stopwords.txt) 以删除停用词.

Can you please suggest on this and may i include a text file like (stopwords.txt) in order to remove the stop words.

我是 Pig 的天真用户

I am a naive user of Pig

推荐答案

诸如移除停用词之类的操作非常复杂,因此不会出现在内置函数中.你需要写一个用户定义的函数,这是相当操作简单.

Something like removing stop words is complicated enough that it is not going to be in the built-in functions. You'll need to write a user-defined function, which is quite simple to do.

-- load the data line by line
lines = LOAD 'datafile.txt' USING TextLoader() AS (line:chararray);

-- apply some sort of UDF that returns the exact line without the stop words
nostop = FOREACH lines GENERATE myudfs.removestop(line);

-- store the data out
STORE nonstop INTO 'datafile_nostop.txt';

将您的列表推送到任务中是另一回事.如果列表相对较小,大约为数千个,您可以将停用词烘焙到您的代码中(即,对列表进行硬编码)以使其可用.否则,您可以使用分布式缓存将文件推出.

Pushing that list of yours out to the tasks is another story. If the list is relatively small, in the order of thousands, you can bake the stop words into your code (i.e., hardcoding the list) so that it has it available. Otherwise, you could use the Distributed Cache to push the file out.

根据您提供的更多信息,我可以建议另一种方法.不过,我上面使用 UDF 的方法仍然有效.

With the more information you provided, I can suggest an alternative approach. My above approach of using a UDF is still valid, though.

这种新方法将涉及您加载其他文件,然后有效地执行反连接以删除与列表匹配的内容.您需要确保 stopwords.txt 每行有一个单词才能使其正常工作.要进行反连接(即,将与其他列表不匹配的内容保留在列表中),我将执行 左外连接(使用 replicated),然后过滤掉停用词列为空的地方(即,它没有匹配的停用词).

This new approach will involve you loading your other file, then effectively doing an anti-join to remove things that match the list. You need to make sure stopwords.txt has one word per line in order for this to work. To do the anti-join (i.e., keep the things the list that do not match the other list), I'll do a left outer join (using replicated), then filter out where the stop word column is null (i.e., it did not have a stop word that matched).

A = load '/home/wrdtest.txt';

-- load the stop words list
SW = load '/home/stopwords.txt' as (stopword:chararray);    

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

-- join the data with a left outer join
-- using replicated should be done with the right relation (SW) is small
SW2 = join B by word LEFT OUTER, SW by stopword USING 'replicated';

-- filter out where the stopword is null, meaning it is not in the stopword list
C = filter SW2 by stopword IS NULL;

-- remove the stopword column that we don't need.
C = foreach C generate word;

D = group C by word;

E = foreach D generate COUNT(C) as count, group as word;

F = order E by count desc;

store F into '/tmp/sample_data20';

这篇关于删除“字符串表达式"的过程/代码是什么?从使用 Apache Pig 的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆