在hadoop中使用正则表达式 [英] Using Regex in Pig in hadoop

查看：1089 发布时间：2017/2/24 19:17:28 regex csv hadoop apache-pig

本文介绍了在hadoop中使用正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含用户（tweetid，tweets，userid）的CSV文件。

  396124436476092416，想想你的生活livin但不要以为这么辛苦它伤害生命是一个真正的礼物，但同样是一个诅se，Obey_Jony09 
 396124436740317184，@ BleacherReport：万圣节给了我们这个惊人的Derrick Rose照片（通过@ amandakaschube，@ScottStrazzante）http://t.co/tM0wEugZR1yes，Colten_stamkos 
 396124436845178880，什么时候12.4k滚动，Matty_T_03

现在我需要写一个Pig查询，返回所有包含'喜欢'一词的tweets，按tweet id排序。

为此，我有以下代码：
A = load'/ user / pig / tweets'as（line）; B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT_ALL（line，'（。*）[，： - ]（。*）[，： - ]（。*）'）AS（tweetid：long，msg ：chararray，userid：chararray）; C = filter B by msg matches'。* favorite。*'; D = order C by tweetid;

正则表达式如何在这里以所需的方式分割输出？ p>

我尝试使用REGEX_EXTRACT而不是REGEX_EXTRACT_ALL，因为我发现更简单，但不能得到代码工作，除了提取tweets：

B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT（line，'[，： - ]（。*）[，： - ]'，1）：chararray）;

上面的别名获取tweets，但如果我使用REGEX_EXTRACT获取tweet_id， o / p： B = FOREACH A GENERATE FLATTEN（REGEX_EXTRACT（line，'（。*）[，： - ]'，1））AS（tweetid：long）;

 （396124554353197056，Just saw @ samantha0wen and @DakotaFears at the drake concert #waddup）
（396124554172432384 ，@ Yutika_Diwadkar我只是那么明亮I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,""@BleacherReport: Halloween has given us this amazing Derrick Rose photo (via @amandakaschube, @ScottStrazzante) http://t.co/tM0wEugZR1" yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.

For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,":-](.*)[",:-](.*)'))  AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches  '.*favorite.*';
D = order C by tweetid;

How does the regular expression work here in splitting the output in desired way?

I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,":-](.*)[",:-]',1))  AS (msg:chararray);

the above alias gets me the tweets, but if I use  REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,":-]',1))  AS (tweetid:long);
(396124554353197056,"Just saw @samantha0wen and @DakotaFears at the drake concert #waddup")
(396124554172432384,"@Yutika_Diwadkar I'm just so bright 
                        这篇关于在hadoop中使用正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在hadoop中使用正则表达式 [英] Using Regex in Pig in hadoop

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录关闭

在hadoop中使用正则表达式 [英] Using Regex in Pig in hadoop

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭