Java或Pig正则表达式从UserAgent字符串中剥离值 [英] Java or Pig regex to strip out values from UserAgent string

查看:195
本文介绍了Java或Pig正则表达式从UserAgent字符串中剥离值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在用户代理字符串的中括号"部分中去除第三个及后续值.

I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.

为了获得

Mozilla/4.0(兼容; MSIE 8.0)

Mozilla/4.0 (compatible; MSIE 8.0)

来自

Mozilla/4.0(兼容; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)

我成功使用了sed命令

I successfully use sed command

 sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'

我需要在带有Java正则表达式的 Apache Pig 中获得相同的结果. 有人可以帮我将上述sed正则表达式重写为Java吗?

I need to get the same result in Apache Pig with a Java regex. Could anybody help me to re-write the above sed regular expression into Java?

类似的东西:

new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);

推荐答案

我不使用Pig,但是通过查看文档可以发现REPLACE函数包装了Java的replaceAll()方法.试试这个:

I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll() method. Try this:

REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')

这应该与UserAgent字符串的整个括号部分匹配,并仅用前两个用分号分隔的术语替换其内容,就像您的sed命令一样.

That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.

这篇关于Java或Pig正则表达式从UserAgent字符串中剥离值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆