Java正则表达式性能 [英] Java regex performance

查看:303
本文介绍了Java正则表达式性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用Java解析带有正则表达式的链接。

I'm trying to parse links with regex with Java.

但我觉得它太慢了。例如,要从中提取所有链接:

But I think it's getting too slow. For example, to extract all links from:

  • http://news.google.com.ar/nwshp?hl=es&tab=wn

......花费34642毫秒(34秒!!!)

...it's spending 34642 milliseconds (34 seconds!!!)

这是正则表达式:

private final String regexp = "<a.*?\\shref\\s*=\\s*([\\\"\\']*)(.*?)([\\\"\\'\\s].*?>|>)";

模式的标志:

private static final int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL |Pattern.MULTILINE | Pattern.UNICODE_CASE | Pattern.CANON_EQ;

代码可能是这样的:

private void processURL(URL url){
    URLConnection connection;
    Pattern pattern = Pattern.compile(regexp, flags);
    try {
        connection = url.openConnection();
        InputStream in = connection.getInputStream();
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        String html = new String();
        String line = bf.readLine();            
        while(line!=null){
            html += line;
            line = bf.readLine();
        }
        bf.close();
        Matcher matcher = pattern.matcher(html);
        while (matcher.find()) {
            System.out.println(matcher.group(2));
        }
     } catch (Exception e){
     }
 }

你能给我一个提示吗?

额外数据:

1Mbit

Core 2 Duo

1Gb RAM

单线程

Extra Data:
1Mbit
Core 2 Duo
1Gb RAM
Single Threaded

推荐答案

提示:不要使用正则表达式进行链接提取或其他HTML解析任务!

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

你的正则表达式中有6个(SIX)重复组。执行它将需要大量的回溯。在最坏的情况下,它甚至可以接近 O(N ^ 6)其中N是输入字符的数量。你可以通过用懒惰匹配替换急切匹配来缓解这一点,但几乎不可能避免病态情况;例如当输入数据的格式错误使得正则表达式不匹配时。

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

一个更好的解决方案是使用一些现有的严格或允许的HTML解析器。即使手动编写ad-hoc解析器也比使用gnarly regex更好。

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

此页面列出了各种Java解析器。我听说过TagSoup和HtmlCleaner的好消息。

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

这篇关于Java正则表达式性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆