大数据的有效正则表达式(如果字符串包含单词) [英] Efficient Regular Expression for big data, if a String contains a word

查看:113
本文介绍了大数据的有效正则表达式(如果字符串包含单词)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可以运行的代码,但是速度非常慢。此代码确定字符串是否包含关键字。我必须高效处理要在成千上万个文档中搜索的数百个关键字。

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.

我该怎么做才能找到关键字(不会错误返回)

What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?

例如:

String keyword="ac"; 
String document"..."  //few page long file

如果我使用:

if(document.contains(keyword) ){
//do something
}

如果文档中包含诸如 account之类的字词,也会返回true;

It will also return true if document contains a word like "account";

,因此我尝试使用正则表达式,如下所示:

so I tried to use regular expression as follows:

String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
   //do something
}

摘要:

摘要如下:希望对其他人有用:

This is the summary: Hopefully it will be useful to some one else:


  1. 我的正则表达式可以工作,但在
    处理大数据时非常不切实际。 (它没有终止)

  2. @anubhava完善了正则表达式。
    易于理解和实施。它设法终止了,这是一件大事。但是还是有点慢。 (大约需要240秒)

  3. @Tomalak解决方案的实现和理解有些复杂,但是
    是最快的解决方案。 (18秒)

所以@Tomalak解决方案比@anubhava快15倍。

so @Tomalak solution was ~15 times faster than @anubhava.

推荐答案

在Java中查找子字符串最快的方法是使用 String.indexOf()

The fastest-possible way to find substrings in Java is to use String.indexOf().

要实现仅整字匹配,您需要添加一点逻辑来检查可能匹配之前和之后的字符,以确保它们不是非单词字符:

To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:

public class IndexOfWordSample {
    public static void main(String[] args) {
        String input = "There are longer strings than this not very long one.";
        String search = "long";
        int index = indexOfWord(input, search);

        if (index > -1) {
            System.out.println("Hit for \"" + search + "\" at position " + index + ".");
        } else {
            System.out.println("No hit for \"" + search + "\".");
        }
    }

    public static int indexOfWord(String input, String word) {
        String nonWord = "^\\W?$", before, after;               
        int index, before_i, after_i = 0;

        while (true) {
            index = input.indexOf(word, after_i);
            if (index == -1 || word.isEmpty()) break;

            before_i = index - 1;
            after_i = index + word.length();
            before = "" + (before_i > -1 ? input.charAt(before_i) : "");            
            after = "" + (after_i < input.length() ? input.charAt(after_i) : "");

            if (before.matches(nonWord) && after.matches(nonWord)) {
                return index;
            }
        }
        return -1;
    }
}

这将打印:


在位置44击多头。

Hit for "long" at position 44.

这应该执行比纯正则表达式更好。

This should perform better than a pure regular expressions approach.

认为 ^ \W?$ 是否已经满足您对a的期望非单词字符。正则表达式是一个折衷方案,如果您的输入字符串包含许多几乎匹配项,则可能会降低性能。

Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.

为了提高速度,请抛开正则表达式并使用 字符,检查它提供的许多属性的组合(例如 isAlphabetic 等),位于之前之后

For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.

我用做到这一点的替代实现创建了一个Gist

这篇关于大数据的有效正则表达式(如果字符串包含单词)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆