如何在Pig中使用REGEX_EXTRACT_ALL [英] How to use REGEX_EXTRACT_ALL in Pig

查看:148
本文介绍了如何在Pig中使用REGEX_EXTRACT_ALL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的示例数据,

subId=00001111911128052627,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212218.4621702216543667E17
subId=00001111911128052639,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212219.6726312167218586E17
subId=00001111911128052615,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212216.9431647633139046E17

我的预期输出将是一个元组,其中每个字段代表一个匹配的组:

My expected output will be a tuple where each field represents a matched group:

(captureing_group1,capture_group2,...,capture_groupN)

(capturing_group1, capturing_group2, ..., capturing_groupN)

例如(00001111911128052627,11232w34532543456345623453456984756894756,122112212212212216.96.9647647633139046E17)

e.g.(00001111911128052627,11232w34532543456345623453456984756894756,122112212212212216.9431647633139046E17)

这是我的方法,

A = load '/home/hduser/Desktop/arrtest1.txt' using TextLoader as (line:chararray);
b = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[subId=](.*)[towerid=](.*)[bytes=](.*)')) AS (F1,F2,F3);

但是我没有得到结果.

推荐答案

根据您的输入示例,您可以尝试使用此正则表达式:

Based on your input example you can try with this regex:

REGEX_EXTRACT_ALL(line,'subId=([^,]*),towerid=([^,]*),bytes=(.*)')

您可以在此链接中检查此正则表达式的行为.

You can check the behaviour of this regex at this link.

更新:为什么不使用.*检查该字段?

Update: why not use .* to check the field?

kleene运算符*的默认贪婪性质会导致正则表达式引擎匹配到字符串的末尾,然后每次返回一个字符,并检查正则表达式的下一部分是否匹配(例如,它搜索第一个.*之后的逗号,).

The default greedy nature of kleene operator * cause the regex engine to matches till the end of the string, then it go back one char per time and to check if the next section of the regex matches (e.g. it searches for a comma , after the first .*).

因此,最后所有下面的正则表达式都匹配,但是使用不同的步骤来完成该过程:

So at the end all the regex below match but with different steps to complete the process:

[a-zA-Z]+=(.*),[a-zA-Z]+=(.*),[a-zA-Z]+=(.*)-1142步

[a-zA-Z]+=(.*),[a-zA-Z]+=(.*),[a-zA-Z]+=(.*) - 1142 steps

subId=([^,]*),towerid=([^,]*),bytes=(.*)-96步.

如果您不关心字段名称,而想要纯字母字段(大写或小写):

If you don't care about the fields name and you want pure letters fields (uppercase or lowercase):

(?i)[a-z]+=([^,]*)[a-z,]+=([^,]*),[a-z,]+=(.*)-58步

NB :Apache Pig regex引擎基于Java,因此不区分大小写的标志(?i)也可能适用.

NB: the Apache Pig regex engine is based on the Java one so the case-insensitive flag (?i) is likely to works too.

这篇关于如何在Pig中使用REGEX_EXTRACT_ALL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆