Java或Pig正则表达式从UserAgent字符串中剥离值 [英] Java or Pig regex to strip out values from UserAgent string
问题描述
我需要在用户代理字符串的中括号"部分中去除第三个及后续值.
I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.
为了获得
Mozilla/4.0(兼容; MSIE 8.0)
Mozilla/4.0 (compatible; MSIE 8.0)
来自
Mozilla/4.0(兼容; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)
我成功使用了sed命令
I successfully use sed command
sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'
我需要在带有Java正则表达式的 Apache Pig 中获得相同的结果. 有人可以帮我将上述sed正则表达式重写为Java吗?
I need to get the same result in Apache Pig with a Java regex. Could anybody help me to re-write the above sed regular expression into Java?
类似的东西:
new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);
推荐答案
我不使用Pig,但是通过查看文档可以发现REPLACE函数包装了Java的replaceAll()
方法.试试这个:
I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll()
method. Try this:
REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')
这应该与UserAgent字符串的整个括号部分匹配,并仅用前两个用分号分隔的术语替换其内容,就像您的sed命令一样.
That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.
这篇关于Java或Pig正则表达式从UserAgent字符串中剥离值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!