配置单元regexp_extract奇怪 [英] hive regexp_extract weirdness
问题描述
我在使用regexp_extract时遇到了一些问题:
我正在查询制表符分隔的文件,我检查的列有如下所示的字符串:
abc.def.ghi
现在,如果我这样做:
select distinct regexp_extract(name,'[^。] +' ,0)from dummy;
MR作业运行,它可以工作,并且从索引0获得abc。
但是现在,如果我想从索引1得到def:
从dummy中选择不同的regexp_extract(name,'[^。] +',1);
配置单元失败:
2011-12-13 23:17:08,132 Stage-1 map = 0%,reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100% ,reduce = 100%
Ended Job = job_201112071152_0071错误
FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.MapRedTask返回代码2
日志文件显示:
java。 lang.RuntimeException:org.apache.hadoop.hive.ql.metadata.HiveException:处理行时的Hive运行时错误
谢谢,
马里奥
从文档 https://cwiki.apache。 org / confluence / display / Hive / LanguageManual + UDF 看来,regexp_extract()是您想要提取的数据的记录/行提取。
它似乎在首次发现(然后退出)而不是全局的情况下工作。因此索引引用了捕获组。
0 =整个比赛
1 =捕获组1 -
2 =捕获第2组等等...
从手册中解释:
regexp_extract('foothebar','foo(。*?)(bar)',2)
^ ^
groups 1 2
返回'bar'。
所以,就你的情况而言,为了得到点后的文本,类似这样的东西可能会起作用: br>
regexp_extract(name,'\。([^。] +)',1)
or this
regexp_extract(name,'[。]([^。] +)',1)
编辑
我对此重新感兴趣,只是一个fyi,可能有一个快捷方式/解决方法。
看起来您希望某个特定的网段以点。
字符分隔,这几乎就像是split。<
使用正则表达式引擎可能会覆盖一个组,如果它不止一次被量化的话,那么很可能会使用正则表达式引擎。
您可以利用类似这样的东西来利用它:
返回第一个片段: abc
.def.ghi
regexp_extract(name,'^ (?:([^。] +)\。?){1}',1)
返回第二个段:abc 。 def
.ghi
regexp_extract(name,'^(?:([^。] +)\。? ){2}',1)
返回第三段:abc.def。 ghi
regexp_extract(name,'^(?:([^。] +)\。?){3}',1)
b
$ b
索引不会改变(因为索引仍然引用捕获组1),只有正则表达式的重复次数发生变化。
一些注释: 这个正则表达式
^(?:([^。] +)\?){N}
虽然有问题。
它要求段中的点之间存在某种东西,否则正则表达式不会匹配 ...
。
也会匹配。这可能是不希望的。
有一种方法可以在点之间不需要文本,但仍然需要至少n-1个点。
这使用一个前瞻断言和捕获缓冲区2作为标志。
^(?:( ?! \2)([^。] *)(?: \。| $()) ){2}
,其他都是一样的。
所以,如果它使用java风格的正则表达式,那么这应该工作。
regexp_extract(name,'^(?:( ?! \2)([^。] *)(?: \。| $())){2}', 1)
将{2}更改为需要的任何'段'(这是段2)。
它仍然返回捕获缓冲区1第{N}次迭代。
在这里它被分解了
(?:#分组
(?!\ 2)#断言:捕获缓冲区2为UNDEFINED
([^。] *)#捕获缓冲区1 ,可选的非点字符,多次
(?:#分组
\。#点字符
|#或
$()#字符串结束,设置捕获缓冲区2 DEFINED(防止当字符串结束时发生递归)
)#结束分组
){3}#结束分组,重复正好3次(或N次)(每次覆盖捕获缓冲区1)
如果它不做断言,那么这是行不通的!
I am having some problems with regexp_extract:
I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.
But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?
Thanks, Mario
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.
It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.
0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...
Paraphrased from the manual:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
^ ^
groups 1 2
This returns 'bar'.
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)
edit
I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.
It looks like you want a particular segment separated with a dot .
character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:
Returns the first segment: abc
.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)
Returns the second segment: abc.def
.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)
Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)
The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.
Some notes:
This regex
^(?:([^.]+)\.?){n}
has problems though.
It requires there be something between dots in the segment or the regex won't match...
.It could be this
^(?:([^.]*)\.?){n}
but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.
There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.
^(?:(?!\2)([^.]*)(?:\.|$())){2}
, everything else is the same.
So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1)
change {2} to whatever 'segment' is needed (this does segment 2).
and it still returns capture buffer 1 after the {N}'th iteration.
Here it is broken down
^ # Begining of string
(?: # Grouping
(?!\2) # Assertion: Capture buffer 2 is UNDEFINED
( [^.]*) # Capture buffer 1, optional non-dot chars, many times
(?: # Grouping
\. # Dot character
| # or,
$ () # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
) # End grouping
){3} # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
If it doesn't do assertions, then this won't work!
这篇关于配置单元regexp_extract奇怪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!