hive regexp_extract 奇怪之处 [英] hive regexp_extract weirdness
问题描述
我在使用 regexp_extract 时遇到了一些问题:
我正在查询一个制表符分隔的文件,我正在检查的列有如下所示的字符串:
abc.def.ghi
现在,如果我这样做:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR 作业运行,它工作正常,我从索引 0 得到abc".
但是现在,如果我想从索引 1 中获取def":
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive 失败:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%2011-12-13 23:17:28,265 第一阶段地图 = 100%,减少 = 100%Ended Job = job_201112071152_0071 有错误失败:执行错误,从 org.apache.hadoop.hive.ql.exec.MapRedTask 返回代码 2
日志文件说:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
我在这里做错了什么吗?
谢谢,马里奥
来自文档 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 似乎 regexp_extract() 是您希望提取的数据的记录/行提取.
它似乎适用于第一个找到(然后退出)而不是全局.因此索引引用捕获组.
0 = 整个比赛
1 = 捕获组 1
2 = 捕获组 2,等等......
从手册中转述:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)^ ^组 1 2这将返回 'bar'.
因此,在您的情况下,要获取点后的文本,类似这样的操作可能会起作用:regexp_extract(name, '.([^.]+)', 1)
或者这个regexp_extract(name, '[.]([^.]+)', 1)
编辑
我对此重新感兴趣,仅供参考,可能有适合您的快捷方式/解决方法.
看起来你想用一个点.
字符分隔一个特定的段,这几乎就像分割一样.
如果它被量化不止一次,则使用的正则表达式引擎很可能会覆盖一个组.
您可以通过以下方式利用它:
返回第一段:abc
.def.ghiregexp_extract(name, '^(?:([^.]+).?){1}', 1)
返回第二段:abc.def
.ghiregexp_extract(name, '^(?:([^.]+).?){2}', 1)
返回第三段:abc.def.ghi
regexp_extract(name, '^(?:([^.]+).?){3}', 1)
索引没有变化(因为索引仍然指向捕获组 1),只有正则表达式重复发生了变化.
一些注意事项:
这个正则表达式
^(?:([^.]+).?){n}
有问题.
它要求段中的点之间有一些东西,否则正则表达式将不匹配...
.可能是这个
^(?:([^.]*).?){n}
但即使少于 n-1 个点,它也会匹配,
包括空字符串.这可能是不可取的.
有一种方法可以做到它不需要点之间的文本,但仍然需要至少 n-1 个点.
这使用先行断言和捕获缓冲区 2 作为标志.
^(?:(?!2)([^.]*)(?:.|$())){2}
,其他都一样.>
所以,如果它使用 java 风格的正则表达式,那么这应该可以工作.regexp_extract(name, '^(?:(?!2)([^.]*)(?:.|$())){2}', 1)
更改{2} 到任何需要的段"(这就是段 2).
并且在第 {N} 次迭代后它仍然返回捕获缓冲区 1.
这里分解了
^ # 字符串开头(?: # 分组(?!2) # 断言:捕获缓冲区 2 未定义( [^.]*) # 捕获缓冲区1,可选非点字符,多次(?: # 分组.#点字符|# 或者,$() # 字符串结束,设置捕获缓冲区2 DEFINED(防止字符串结束时递归)) # 结束分组){3} # 结束分组,重复分组正好 3(或 N)次(每次覆盖捕获缓冲区 1)
如果它不做断言,那么这将不起作用!
I am having some problems with regexp_extract:
I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.
But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?
Thanks, Mario
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.
It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.
0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...
Paraphrased from the manual:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
^ ^
groups 1 2
This returns 'bar'.
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)
edit
I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.
It looks like you want a particular segment separated with a dot .
character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:
Returns the first segment: abc
.def.ghi
regexp_extract(name, '^(?:([^.]+).?){1}', 1)
Returns the second segment: abc.def
.ghi
regexp_extract(name, '^(?:([^.]+).?){2}', 1)
Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+).?){3}', 1)
The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.
Some notes:
This regex
^(?:([^.]+).?){n}
has problems though.
It requires there be something between dots in the segment or the regex won't match...
.It could be this
^(?:([^.]*).?){n}
but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.
There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.
^(?:(?!2)([^.]*)(?:.|$())){2}
, everything else is the same.
So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!2)([^.]*)(?:.|$())){2}', 1)
change {2} to whatever 'segment' is needed (this does segment 2).
and it still returns capture buffer 1 after the {N}'th iteration.
Here it is broken down
^ # Begining of string
(?: # Grouping
(?!2) # Assertion: Capture buffer 2 is UNDEFINED
( [^.]*) # Capture buffer 1, optional non-dot chars, many times
(?: # Grouping
. # Dot character
| # or,
$ () # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
) # End grouping
){3} # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
If it doesn't do assertions, then this won't work!
这篇关于hive regexp_extract 奇怪之处的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!