在文本消息中字符串的计数出现 [英] Pig count occurrence of strings in text messages

查看:1099
本文介绍了在文本消息中字符串的计数出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件 - venues.csv和tweets.csv。我想计算每个场地的次数发生在推文消息从tweets文件。

I've got two files - venues.csv and tweets.csv. I want to count for each of the venues the number of times occurs in the tweet message from the tweets file.

我已经导入csv文件在HCatalog。

I've imported the csv files in HCatalog.

我到目前为止做了什么:

我知道如何过滤 text 字段,并获得包含'Shell'的tweet消息的这些元组。我想做同样的,但不是硬编码的 Shell ,而是名称 venuesNames 包。我该怎么办?此外,如何正确使用生成命令生成一个新的包,匹配的计数结果与场地的名称?

I know how to filter the text fields and to get these tuples that contain 'Shell' their tweet messages. I want to do the same but not with hard-coded Shell, rather for each name from the venuesNames bag. How can I do that? Also then how can I use the generate command properly to generate a new bag that is matching the results from the count with the names of the venues?

a = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
b = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();

venuesNames = foreach a generate name;

countX = FILTER b BY (text matches '.*Shell.*');

venueToCount = generate ('Shell' as venue, COUNT(countX) as countVenues); 

DUMP venueToCount;

我使用的文件是:

tweets.csv

created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow

venues.csv

city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben

知道这些都是基本的问题,但我只是开始与Pig和任何帮助将不胜感激。

I know that these are basic questions but I'm just getting started with Pig and any help will be appreciated!

推荐答案

您的场地名称列表是唯一的。如果没有,那么你还有更多的问题,因为你将需要消除谈论哪个地点(可能参考城市领域)。但不考虑这种潜在的并发症,这里是你可以做的:

I presume that your list of venue names is unique. If not, then you have more problems anyway because you will need to disambiguate which venue is being talked about (perhaps by reference to the city fields). But disregarding that potential complication, here is what you can do:

你描述了一个模糊连接。在Pig中,如果没有办法强制你的记录包含标准值(在这种情况下,不是没有使用UDF),你需要使用 CROSS 运算符。请谨慎使用此操作,因为如果您与 M N 记录具有两个关系,则结果将是与 M * N 记录,这可能超过您的系统可以处理的数量。

You have described a fuzzy join. In Pig, if there is no way to coerce your records to contain standard values (and in this case, there isn't without resorting to a UDF), you need to use the CROSS operator. Use this with caution because if you cross two relations with M and N records, the result will be a relation with M*N records, which might be more than your system can handle.

一般策略是1) CROSS 两个关系,2)为每个记录创建一个自定义正则表达式,3)过滤那些通过正则表达式的正则。

The general strategy is 1) CROSS the two relations, 2) Create a custom regex for each record*, and 3) Filter those that pass the regex.

venues = LOAD 'venues_test_1' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_test_1' USING org.apache.hcatalog.pig.HCatLoader();

/* Create the Cartesian product of venues and tweets */
crossed = CROSS venues, tweets;
/* For each record, create a regex like '.*name.*'
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venues::name, '.*')) AS regex;
/* Keep tweet-venue pairs where the tweet contains the venue name /*
venueMentions = FILTER regexes BY text MATCHES regex;

venueCounts = FOREACH (GROUP venueMentions BY venues::name) GENERATE group, COUNT($1);

所有 venueCounts 的总和可能更多

*注意,你必须小心这个技巧,因为如果场地名称包含字符在Java正则表达式中有特殊解释,则需要转义它们。

*Note that you have to be a little careful with this technique, because if the venue name contains characters that have special interpretations in Java regular expressions, you'll need to escape them.

这篇关于在文本消息中字符串的计数出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆