pig-使用正则表达式解析字符串 [英] pig - parsing string with regex

查看:197
本文介绍了pig-使用正则表达式解析字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被困在Pig中的字符串解析上.

I'm stuck on string parsing in Pig.

我看过有关regex_extractregex_extract_all的文档,希望使用这些功能之一.

I have looked at the documentation around regex_extract and regex_extract_all and hoped to use one of those functions.

我有文件'/logs/test.log':

cat '/logs/test.log'
user=242562&friend=6226&friend=93856&age=35&friend=35900

我想从URL中提取friend标签,在这种情况下,我有3个相同的标签. regex_extract似乎仅适用于第一个实例,这正是我所期望的,而对于regex_extract_all来说,似乎我已经知道整个字符串模式,该模式在源文件的每一行上都会发生变化.

I want to extract the friend tags from the url, and in this case, I have 3 identical tags. regex_extract seems to only work for the first instance, which is what I expected, and for regex_extract_all, it seems like I have know the whole string pattern, which changes on each row of the source file.

使用regex_extract看起来还可以,但是此选项只给我第一个.

It looked ok with regex_extract, but this option only gives me the first one.

 [root@test]# pig -x local
 A = LOAD './test.log';
 B = FOREACH A GENERATE REGEX_EXTRACT($0, 'friend=([0-9]*)',1);
 dump B;
 (6226)

我在regex_extract_all中看到的示例显示了regex,您可以在其中找到所有标签:

The examples I see for regex_extract_all show regex where you seek out all the tags:

  B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, 'user=([0-9]+?)&friend=([0-9]+?)&friend=([0-9]+?)&.+?'));
 dump B;
 (242562,6226,93856)

这似乎可行,但我真的只想提取朋友-(6226,93856,35900).在某些情况下,每位用户可能有3个以上的朋友.

That seems to work, but I really just want to extract the friends - (6226,93856,35900). I also have cases where there might be more-than or less-than 3 friends per user.

有什么想法吗?

还要考虑使用FLATTEN(TOKENIZE($0,'&'))之类的东西,然后以某种方式仅对SUBSTRING($0,0,INDEXOF($0,'=')) == 'friend'或类似的东西进行过滤,但想看看是否有人知道一种很好的正则表达式方法.

Also looking at using something like FLATTEN(TOKENIZE($0,'&')) and then somehow only filtering on the SUBSTRING($0,0,INDEXOF($0,'=')) == 'friend' or something like that, but wanted to see if anyone knew a good regex approach.

推荐答案

这可以通过简单的字符串操作来实现:

This can be achieved by simple string manipulations:

inputs = LOAD 'input' AS (line: chararray);
tokenized = FOREACH inputs GENERATE FLATTEN(TOKENIZE(line, '&')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, 'friend=') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 7, (int)SIZE(parameter)) AS   friend_number;
DESCRIBE tokenized;
DUMP tokenized;
DESCRIBE filtered;
DUMP filtered;
DESCRIBE result;
DUMP result;

结果:

tokenized: {parameter: chararray}
(user=242562)
(friend=6226)
(friend=93856)
(age=35)
(friend=35900)
filtered: {parameter: chararray}
(friend=6226)
(friend=93856)
(friend=35900)
result: {friend_number: chararray}
(6226)
(93856)
(35900)

这篇关于pig-使用正则表达式解析字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆