配置单元regexp_extract奇怪 [英] hive regexp_extract weirdness

查看:128
本文介绍了配置单元regexp_extract奇怪的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用regexp_extract时遇到了一些问题:



我正在查询制表符分隔的文件,我检查的列有如下所示的字符串:

  abc.def.ghi 

现在,如果我这样做:

  select distinct regexp_extract(name,'[^。] +' ,0)from dummy; 

MR作业运行,它可以工作,并且从索引0获得abc。



但是现在,如果我想从索引1得到def:

 从dummy中选择不同的regexp_extract(name,'[^。] +',1); 

配置单元失败:

  2011-12-13 23:17:08,132 Stage-1 map = 0%,reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100% ,reduce = 100%
Ended Job = job_201112071152_0071错误
FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.MapRedTask返回代码2

日志文件显示:

  java。 lang.RuntimeException:org.apache.hadoop.hive.ql.metadata.HiveException:处理行时的Hive运行时错误





谢谢,
马里奥

解决方案

从文档 https://cwiki.apache。 org / confluence / display / Hive / LanguageManual + UDF 看来,regexp_extract()是您想要提取的数据的记录/行提取。

它似乎在首次发现(然后退出)而不是全局的情况下工作。因此索引引用了捕获组。

0 =整个比赛

1 =捕获组1 -
2 =捕获第2组等等...



从手册中解释:

  regexp_extract('foothebar','foo(。*?)(bar)',2)
^ ^
groups 1 2

返回'bar'。

所以,就你的情况而言,为了得到点后的文本,类似这样的东西可能会起作用: br>
regexp_extract(name,'\。([^。] +)',1)

or this

regexp_extract(name,'[。]([^。] +)',1)

编辑



我对此重新感兴趣,只是一个fyi,可能有一个快捷方式/解决方法。

看起来您希望某个特定的网段以点字符分隔,这几乎就像是split。<
使用正则表达式引擎可能会覆盖一个组,如果它不止一次被量化的话,那么很可能会使用正则表达式引擎。

您可以利用类似这样的东西来利用它:



返回第一个片段: abc .def.ghi

regexp_extract(name,'^ (?:([^。] +)\。?){1}',1)



返回第二个段:abc 。 def .ghi

regexp_extract(name,'^(?:([^。] +)\。? ){2}',1)



返回第三段:abc.def。 ghi

regexp_extract(name,'^(?:([^。] +)\。?){3}',1) b
$ b

索引不会改变(因为索引仍然引用捕获组1),只有正则表达式的重复次数发生变化。



一些注释:


  • 这个正则表达式 ^(?:([^。] +)\?){N} 虽然有问题。

    它要求段中的点之间存在某种东西,否则正则表达式不会匹配 ...

  • c $ c>,但即使小于n-1个点,包含空字符串的
    也会匹配。这可能是不希望的。


有一种方法可以在点之间不需要文本,但仍然需要至少n-1个点。

这使用一个前瞻断言和捕获缓冲区2作为标志。


^(?:( ?! \2)([^。] *)(?: \。| $()) ){2} ,其他都是一样的。



所以,如果它使用java风格的正则表达式,那么这应该工作。

regexp_extract(name,'^(?:( ?! \2)([^。] *)(?: \。| $())){2}', 1)将{2}更改为需要的任何'段'(这是段2)。

它仍然返回捕获缓冲区1第{N}次迭代。



在这里它被分解了

 (?:#分组
(?!\ 2)#断言:捕获缓冲区2为UNDEFINED
([^。] *)#捕获缓冲区1 ,可选的非点字符,多次
(?:#分组
\。#点字符
|#或
$()#字符串结束,设置捕获缓冲区2 DEFINED(防止当字符串结束时发生递归)
)#结束分组
){3}#结束分组,重复正好3次(或N次)(每次覆盖捕获缓冲区1)

如果它不做断言,那么这是行不通的!

I am having some problems with regexp_extract:

I am querying on a tab-delimited file, the column I'm checking has strings that look like this:

abc.def.ghi

Now, if I do:

select distinct regexp_extract(name, '[^.]+', 0) from dummy;

MR job runs, it works, and I get "abc" from index 0.

But now, if I want to get "def" from index 1:

select distinct regexp_extract(name, '[^.]+', 1) from dummy;

Hive fails with:

2011-12-13 23:17:08,132 Stage-1 map = 0%,  reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Log file says:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row

Am I doing something fundamentally wrong here?

Thanks, Mario

解决方案

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.

It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.

0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...

Paraphrased from the manual:

regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
                                  ^    ^   
               groups             1    2

This returns 'bar'.

So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)

edit

I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.

It looks like you want a particular segment separated with a dot . character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:

Returns the first segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)

Returns the second segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)

Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)

The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.

Some notes:

  • This regex ^(?:([^.]+)\.?){n} has problems though.
    It requires there be something between dots in the segment or the regex won't match ....

  • It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
    including the empty string. This is probably not desireable.

There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.

^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.

So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).

and it still returns capture buffer 1 after the {N}'th iteration.

Here it is broken down

^                # Begining of string
 (?:             # Grouping
    (?!\2)            # Assertion: Capture buffer 2 is UNDEFINED
    ( [^.]*)          # Capture buffer 1, optional non-dot chars, many times
    (?:               # Grouping
        \.                # Dot character
      |                 # or,
        $ ()              # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
    )                 # End grouping
 ){3}            # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)

If it doesn't do assertions, then this won't work!

这篇关于配置单元regexp_extract奇怪的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆