PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin
问题描述
我有以下文件:
File
----
12-3 John 121
5-1 Sam 122
文件以制表符(\t
) 分隔.我将行作为 line:chararray
加载,因为我希望数据不会被拆分到各个字段中.
The file is tab(\t
) delimited. I am loading the row as line:chararray
as I want the data not to be split in individual fields.
现在,我想将详细信息(12-3 和 5-1)提取并存储为单独的数据.
And now, I want to pull and store the details (12-3, and 5-1) as separate data.
我正在尝试使用 STRSPLIT
和 REGEX_EXTRACT_ALL
,但数据似乎不匹配.
I am trying with STRSPLIT
and REGEX_EXTRACT_ALL
, but the data doesn't seem to match.
splitdata = FOREACH filedata {
regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)');
split = STRSPLIT(line, '\\t', 1);
GENERATE regex, split;
};
这就是我想要的最终数据:
This is how I want my final data to be:
(12, 3, 12-3 John 121)
( 5, 1, 5-1 Sam 122)
推荐答案
感谢 Lorand.
既然您对如何使用 REGEX_EXTRACT_ALL
有了一点想法,下面是我最终使用它的方法.
Since you gave a little idea about how to use the REGEX_EXTRACT_ALL
, here is how I finally used it.
FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*'))
AS (FIELD1:chararray, FIELD2:chararray), line;
很有趣地知道 Matcher.matches() 对 '^([0-9]*)\\-([0-9]*)'
失败,而对 有效'^([0-9]*)\\-([0-9]*).*'
.
Pretty interesting to know that Matcher.matches() fails for '^([0-9]*)\\-([0-9]*)'
while works for '^([0-9]*)\\-([0-9]*).*'
.
这篇关于PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!