PigLatin中的STRSPLIT和REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin
问题描述
我有以下文件:
文件
----
12-3约翰121
5-1山姆122
该文件是选项卡( \ t
)分隔。我将行加载为 line:chararray
,因为我希望数据不会分割到单个字段中。
现在,我想将细节(12-3和5-1)作为单独的数据进行抽取和存储。
我试着用 STRSPLIT
和 REGEX_EXTRACT_ALL
,但是数据似乎不匹配。
splitdata = FOREACH FILEDATA {
正则表达式= REGEX_EXTRACT_ALL(线, '^([0-9] *)\\ - ([0-9] *)');
split = STRSPLIT(line,'\\t',1);
GENERATE regex,split;
};
这就是我希望得到的最终数据:
(12,3,12-3 John 121)
(5,1,5-1 Sam 122)
感谢洛兰德。
既然你给了一个关于如何使用
REGEX_EXTRACT_ALL
的小概念,这里是我最终如何使用它的。FOREACH甲GENERATE FLATTEN(REGEX_EXTRACT_ALL(行, '^([0-9] *)\\ - ([0-9] *)*。'))
AS(FIELD1: chararray,FIELD2:chararray),行;
知道Matcher.matches()在
'^ [0-9] *)\\ - ([0-9] *)'
适用于'^([0-9] *)\\\ \\ - ([0-9] *)。*'
。I have a following file:
File ---- 12-3 John 121 5-1 Sam 122
The file is tab(
\t
) delimited. I am loading the row asline:chararray
as I want the data not to be split in individual fields.And now, I want to pull and store the details (12-3, and 5-1) as separate data.
I am trying with
STRSPLIT
andREGEX_EXTRACT_ALL
, but the data doesn't seem to match.splitdata = FOREACH filedata { regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)'); split = STRSPLIT(line, '\\t', 1); GENERATE regex, split; };
This is how I want my final data to be:
(12, 3, 12-3 John 121) ( 5, 1, 5-1 Sam 122)
解决方案Thanks Lorand.
Since you gave a little idea about how to use the
REGEX_EXTRACT_ALL
, here is how I finally used it.FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*')) AS (FIELD1:chararray, FIELD2:chararray), line;
Pretty interesting to know that Matcher.matches() fails for
'^([0-9]*)\\-([0-9]*)'
while works for'^([0-9]*)\\-([0-9]*).*'
.这篇关于PigLatin中的STRSPLIT和REGEX_EXTRACT_ALL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!