PigLatin中的STRSPLIT和REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

查看:283
本文介绍了PigLatin中的STRSPLIT和REGEX_EXTRACT_ALL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下文件:

 文件
----
12-3约翰121
5-1山姆122

该文件是选项卡( \ t )分隔。我将行加载为 line:chararray ,因为我希望数据不会分割到单个字段中。



现在,我想将细节(12-3和5-1)作为单独的数据进行抽取和存储。



我试着用 STRSPLIT REGEX_EXTRACT_ALL ,但是数据似乎不匹配。

  splitdata = FOREACH FILEDATA {
正则表达式= REGEX_EXTRACT_ALL(线, '^([0-9] *)\\ - ([0-9] *)');
split = STRSPLIT(line,'\\t',1);
GENERATE regex,split;
};

这就是我希望得到的最终数据:

 (12,3,12-3 John 121)
(5,1,5-1 Sam 122)

感谢洛兰德。



既然你给了一个关于如何使用 REGEX_EXTRACT_ALL 的小概念,这里是我最终如何使用它的。

  FOREACH甲GENERATE FLATTEN(REGEX_EXTRACT_ALL(行, '^([0-9] *)\\  - ([0-9] *)*。'))
AS(FIELD1: chararray,FIELD2:chararray),行;

知道Matcher.matches()在'^ [0-9] *)\\ - ([0-9] *)'适用于'^([0-9] *)\\\ \\ - ([0-9] *)。*'


I have a following file:

File
----
12-3    John    121
 5-1    Sam     122

The file is tab(\t) delimited. I am loading the row as line:chararray as I want the data not to be split in individual fields.

And now, I want to pull and store the details (12-3, and 5-1) as separate data.

I am trying with STRSPLIT and REGEX_EXTRACT_ALL, but the data doesn't seem to match.

splitdata = FOREACH filedata {
    regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)');
    split = STRSPLIT(line, '\\t', 1);
    GENERATE regex, split;
};

This is how I want my final data to be:

(12, 3, 12-3    John    121)
( 5, 1,  5-1    Sam     122)

解决方案

Thanks Lorand.

Since you gave a little idea about how to use the REGEX_EXTRACT_ALL, here is how I finally used it.

FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*')) 
  AS (FIELD1:chararray, FIELD2:chararray), line;

Pretty interesting to know that Matcher.matches() fails for '^([0-9]*)\\-([0-9]*)' while works for '^([0-9]*)\\-([0-9]*).*'.

这篇关于PigLatin中的STRSPLIT和REGEX_EXTRACT_ALL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆