Java 或 Pig 正则表达式从 UserAgent 字符串中去除值 [英] Java or Pig regex to strip out values from UserAgent string
问题描述
我需要删除用户代理字符串的括号"部分中的第三个和后续值.
I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.
为了得到
Mozilla/4.0(兼容;MSIE 8.0)
Mozilla/4.0 (compatible; MSIE 8.0)
来自
Mozilla/4.0(兼容;MSIE 8.0;Windows NT 6.0;Trident/4.0;GTB6;SLCC1;.NET CLR 2.0.50727;媒体中心 PC 5.0;.NET CLR 3.5.30729;WinTSI 06.12.NET 2009CLR 3.0.30729;.NET4.0C)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)
我成功使用了 sed 命令
I successfully use sed command
sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'
我需要使用 Java 正则表达式在 Apache Pig 中获得相同的结果.有人能帮我把上面的 sed 正则表达式改写成 Java 吗?
I need to get the same result in Apache Pig with a Java regex. Could anybody help me to re-write the above sed regular expression into Java?
类似于:
new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);
推荐答案
我不使用 Pig,但翻阅文档会发现一个 REPLACE 函数包装了 Java 的 replaceAll()
方法.试试这个:
I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll()
method. Try this:
REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')
这应该匹配 UserAgent 字符串的整个括号部分,并用前两个分号分隔的术语替换其内容,就像您的 sed 命令一样.
That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.
这篇关于Java 或 Pig 正则表达式从 UserAgent 字符串中去除值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!