Pig 计数文本消息中字符串的出现次数 [英] Pig count occurrence of strings in text messages

查看:24
本文介绍了Pig 计数文本消息中字符串的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件 -venues.csv 和 tweets.csv.我想为每个场地计算推文文件中推文消息中出现的次数.

I've got two files - venues.csv and tweets.csv. I want to count for each of the venues the number of times occurs in the tweet message from the tweets file.

我已经在 HCatalog 中导入了 csv 文件.

I've imported the csv files in HCatalog.

到目前为止我所做的:

我知道如何过滤 text 字段并获取这些包含 'Shell' 推文消息的元组.我想做同样的事情,但不是使用硬编码的 Shell,而是针对 venuesNames 包中的每个 name.我怎样才能做到这一点?另外,我如何正确使用 generate 命令来生成一个新包,该包将计数结果与场地名称相匹配?

I know how to filter the text fields and to get these tuples that contain 'Shell' their tweet messages. I want to do the same but not with hard-coded Shell, rather for each name from the venuesNames bag. How can I do that? Also then how can I use the generate command properly to generate a new bag that is matching the results from the count with the names of the venues?

a = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
b = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();

venuesNames = foreach a generate name;

countX = FILTER b BY (text matches '.*Shell.*');

venueToCount = generate ('Shell' as venue, COUNT(countX) as countVenues); 

DUMP venueToCount;

我使用的文件是:

tweets.csv

created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow

venues.csv

city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben

我知道这些是基本问题,但我刚刚开始使用 Pig,任何帮助将不胜感激!

I know that these are basic questions but I'm just getting started with Pig and any help will be appreciated!

推荐答案

我认为您的场地名称列表是独一无二的.如果不是,那么无论如何您都会遇到更多问题,因为您需要消除正在谈论的地点的歧义(也许通过参考城市字段).但是,如果不考虑潜在的并发症,您可以执行以下操作:

I presume that your list of venue names is unique. If not, then you have more problems anyway because you will need to disambiguate which venue is being talked about (perhaps by reference to the city fields). But disregarding that potential complication, here is what you can do:

您描述了模糊连接.在 Pig 中,如果无法强制您的记录包含标准值(在这种情况下,必须求助于 UDF),您需要使用 CROSS 运算符.请谨慎使用,因为如果您将两个关系与 MN 记录交叉,结果将是一个与 M*N 记录的关系,即可能超出您的系统的处理能力.

You have described a fuzzy join. In Pig, if there is no way to coerce your records to contain standard values (and in this case, there isn't without resorting to a UDF), you need to use the CROSS operator. Use this with caution because if you cross two relations with M and N records, the result will be a relation with M*N records, which might be more than your system can handle.

一般的策略是 1) CROSS 两个关系,2) 为每条记录创建一个自定义的正则表达式*,以及 3) 过滤那些通过正则表达式的.

The general strategy is 1) CROSS the two relations, 2) Create a custom regex for each record*, and 3) Filter those that pass the regex.

venues = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();

/* Create the Cartesian product of venues and tweets */
crossed = CROSS venues, tweets;
/* For each record, create a regex like '.*name.*'
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venues::name, '.*')) AS regex;
/* Keep tweet-venue pairs where the tweet contains the venue name /*
venueMentions = FILTER regexes BY text MATCHES regex;

venueCounts = FOREACH (GROUP venueMentions BY venues::name) GENERATE group, COUNT($1);

所有 venueCounts 的总和可能大于推文的数量,如果一些推文提到多个地点.

The sum of all venueCounts might be more than the number of tweets, if some tweets mention multiple venues.

*请注意,您必须小心使用此技术,因为如果场地名称包含在 Java 正则表达式中具有特殊解释的字符,则需要对它们进行转义.

*Note that you have to be a little careful with this technique, because if the venue name contains characters that have special interpretations in Java regular expressions, you'll need to escape them.

这篇关于Pig 计数文本消息中字符串的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆