配置单元regexp_extract奇怪 [英] hive regexp_extract weirdness

查看：128 发布时间：2018/6/12 13:34:21 regex hive

本文介绍了配置单元regexp_extract奇怪的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在使用regexp_extract时遇到了一些问题：

我正在查询制表符分隔的文件，我检查的列有如下所示的字符串：

  abc.def.ghi

现在，如果我这样做：

  select distinct regexp_extract（name，'[^。] +' ，0）from dummy;

MR作业运行，它可以工作，并且从索引0获得abc。

但是现在，如果我想从索引1得到def：

从dummy中选择不同的regexp_extract（name，'[^。] +'，1）;
配置单元失败：

2011-12-13 23：17：08,132 Stage-1 map = 0％，reduce = 0％ 2011-12-13 23:17：28,265 Stage-1 map = 100％，reduce = 100％ Ended Job = job_201112071152_0071错误 FAILED：执行错误，从org.apache.hadoop.hive.ql.exec.MapRedTask返回代码2
日志文件显示：

java。 lang.RuntimeException：org.apache.hadoop.hive.ql.metadata.HiveException：处理行时的Hive运行时错误

谢谢，
马里奥
解决方案
从文档 https：//cwiki.apache。 org / confluence / display / Hive / LanguageManual + UDF 看来，regexp_extract（）是您想要提取的数据的记录/行提取。

它似乎在首次发现（然后退出）而不是全局的情况下工作。因此索引引用了捕获组。

0 =整个比赛

1 =捕获组1 -
2 =捕获第2组等等...

从手册中解释：

regexp_extract（'foothebar'，'foo（。*？）（bar）'，2） ^ ^ groups 1 2 返回'bar'。
所以，就你的情况而言，为了得到点后的文本，类似这样的东西可能会起作用： br>
regexp_extract（name，'\。（[^。] +）'，1）

or this

regexp_extract（name，'[。]（[^。] +）'，1）

编辑

我对此重新感兴趣，只是一个fyi，可能有一个快捷方式/解决方法。

看起来您希望某个特定的网段以点。字符分隔，这几乎就像是split。<
使用正则表达式引擎可能会覆盖一个组，如果它不止一次被量化的话，那么很可能会使用正则表达式引擎。

您可以利用类似这样的东西来利用它：

返回第一个片段： abc .def.ghi

regexp_extract（name，'^ （？：（[^。] +）\。？）{1}'，1）

返回第二个段：abc 。 def .ghi

regexp_extract（name，'^（？：（[^。] +）\。？）{2}'，1）

返回第三段：abc.def。 ghi

regexp_extract（name，'^（？：（[^。] +）\。？）{3}'，1） b
$ b
索引不会改变（因为索引仍然引用捕获组1），只有正则表达式的重复次数发生变化。

一些注释：

这个正则表达式 ^（？：（[^。] +）\？）{N} 虽然有问题。

它要求段中的点之间存在某种东西，否则正则表达式不会匹配 ... 。
c $ c>，但即使小于n-1个点，包含空字符串的
也会匹配。这可能是不希望的。

有一种方法可以在点之间不需要文本，但仍然需要至少n-1个点。

这使用一个前瞻断言和捕获缓冲区2作为标志。

^（？:( ?! \2）（[^。] *）（?: \。| $（）））{2} ，其他都是一样的。

所以，如果它使用java风格的正则表达式，那么这应该工作。

regexp_extract（name，'^（？:( ?! \2）（[^。] *）（?: \。| $（）））{2}'， 1）将{2}更改为需要的任何'段'（这是段2）。

它仍然返回捕获缓冲区1第{N}次迭代。

在这里它被分解了

（？：＃分组
（？！\ 2）＃断言：捕获缓冲区2为UNDEFINED
（[^。] *）＃捕获缓冲区1 ，可选的非点字符，多次
（？：＃分组
\。＃点字符
|＃或
$（）＃字符串结束，设置捕获缓冲区2 DEFINED（防止当字符串结束时发生递归）
）＃结束分组
）{3}＃结束分组，重复正好3次（或N次）（每次覆盖捕获缓冲区1）

如果它不做断言，那么这是行不通的！
I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.

But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0% 2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100% Ended Job = job_201112071152_0071 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?

Thanks, Mario
解决方案
From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.

It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...

Paraphrased from the manual:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2) ^ ^ groups 1 2 This returns 'bar'.
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)

edit

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.

It looks like you want a particular segment separated with a dot . character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:

Returns the first segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

Returns the second segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

Some notes:

This regex ^(?:([^.]+)\.?){n} has problems though.
It requires there be something between dots in the segment or the regex won't match ....

It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.

There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.

^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

and it still returns capture buffer 1 after the {N}'th iteration.

Here it is broken down
^ # Begining of string (?: # Grouping (?!\2) # Assertion: Capture buffer 2 is UNDEFINED ( [^.]*) # Capture buffer 1, optional non-dot chars, many times (?: # Grouping \. # Dot character | # or, $ () # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string) ) # End grouping ){3} # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
If it doesn't do assertions, then this won't work!

这篇关于配置单元regexp_extract奇怪的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

配置单元regexp_extract奇怪 [英] hive regexp_extract weirdness

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

配置单元regexp_extract奇怪 [英] hive regexp_extract weirdness

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭