REGEXP_REPLACE捕获组 [英] REGEXP_REPLACE capturing groups

查看:138
本文介绍了REGEXP_REPLACE捕获组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人可以帮助我了解如何使用Hive的regexp_replace函数捕获正则表达式中的组并在替换字符串中使用这些组.

I was wondering if someone could help me understand how to use Hive's regexp_replace function to capture groups in the regex and use those groups in the replacement string.

我在下面遇到的一个示例问题涉及日期计算.在此示例中,我的目标是采用与SimpleDateFormat解析不兼容的字符串日期,并进行一些小的调整以使其兼容.日期字符串(如下所示)需要在字符串的偏移号(+/-)之前加上"GMT".

I have an example problem I'm working through below that involves date-munging. In this example, my goal is to take a string date that is not compatible with SimpleDateFormat parsing and make a small adjustment to get it to be compatible. The date string (shown below) needs "GMT" prepended to the offset sign (+/-) in the string.

因此,鉴于输入:

  '2015-01-01 02:03:04 +0:00' 
  -or-
  '2015-01-01 02:03:04 -1:00' 

我想要输出:

  '2015-01-01 02:03:04 GMT+0:00'
  -or-
  '2015-01-01 02:03:04 GMT-1:00'

这是我认为"可行的简单示例,但输出却很奇怪.

Here is a simple example of a statement that I 'thought' would work, but I get strange output.

配置查询:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT\1');

实际结果:

2015-01-01 02:03:04 GMT10:00

请注意,"\ 1"应输出匹配的组,但应将匹配的组替换为数字"1".

Note that the "\1" should output the matched group, but instead replaces the matched group with the number "1".

有人可以帮助我了解在替换字符串中引用/输出匹配组的正确方法吗?

Can someone help me understand the right way to reference/output matched groups in the replacement string?

谢谢!

推荐答案

Hive的正则表达式反向引用支持的符号(至少适用于0.14,我还记得0.13.x也是如此) >表示捕获组1,$2表示捕获组2,依此类推.它似乎基于

Hive's supported notation (at least for 0.14, and I think I recall it being this way for 0.13.x as well) for regex backreferences seems to be $1 for capture group 1, $2 for capture group 2, etc. It looks like it is based upon (and may even be implemented by) the replaceAll method from the Matcher class. This is the germane portion of that documentation:

可以将美元符号视为对如上所述捕获的子序列的引用,并且反斜杠用于转义替换字符串中的文字字符.

Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

所以我想你想要的是

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');

例如:

hive> select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT+0:00
Time taken: 0.072 seconds, Fetched: 1 row(s) 
hive> select regexp_replace('2015-01-01 02:03:04 -1:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT-1:00
Time taken: 0.144 seconds, Fetched: 1 row(s)

这篇关于REGEXP_REPLACE捕获组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆