PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

查看：21 发布时间：2021/11/12 4:14:07 hadoop apache-pig

本文介绍了PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下文件:

File
----
12-3    John    121
 5-1    Sam     122

文件以制表符(\t) 分隔.我将行作为 line:chararray 加载，因为我希望数据不会被拆分到各个字段中.

The file is tab(\t) delimited. I am loading the row as line:chararray as I want the data not to be split in individual fields.

现在，我想将详细信息(12-3 和 5-1)提取并存储为单独的数据.

And now, I want to pull and store the details (12-3, and 5-1) as separate data.

我正在尝试使用 STRSPLIT 和 REGEX_EXTRACT_ALL，但数据似乎不匹配.

I am trying with STRSPLIT and REGEX_EXTRACT_ALL, but the data doesn't seem to match.

splitdata = FOREACH filedata {
    regex = REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*)');
    split = STRSPLIT(line, '\\t', 1);
    GENERATE regex, split;
};

这就是我想要的最终数据:

This is how I want my final data to be:

(12, 3, 12-3    John    121)
( 5, 1,  5-1    Sam     122)

推荐答案

感谢 Lorand.

既然您对如何使用 REGEX_EXTRACT_ALL 有了一点想法，下面是我最终使用它的方法.

Since you gave a little idea about how to use the REGEX_EXTRACT_ALL, here is how I finally used it.

FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([0-9]*)\\-([0-9]*).*')) 
  AS (FIELD1:chararray, FIELD2:chararray), line;

很有趣地知道 Matcher.matches() 对 '^([0-9]*)\\-([0-9]*)' 失败，而对 有效'^([0-9]*)\\-([0-9]*).*'.

Pretty interesting to know that Matcher.matches() fails for '^([0-9]*)\\-([0-9]*)' while works for '^([0-9]*)\\-([0-9]*).*'.

这篇关于PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PigLatin 中的 STRSPLIT 和 REGEX_EXTRACT_ALL [英] STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭