大数据的有效正则表达式（如果字符串包含单词） [英] Efficient Regular Expression for big data, if a String contains a word

查看：113 发布时间：2020/10/8 21:58:50 java regex string contains

本文介绍了大数据的有效正则表达式（如果字符串包含单词）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个可以运行的代码，但是速度非常慢。此代码确定字符串是否包含关键字。我必须高效处理要在成千上万个文档中搜索的数百个关键字。

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.

我该怎么做才能找到关键字（不会错误返回）

What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?

例如：

String keyword="ac"; 
String document"..."  //few page long file

如果我使用：

if(document.contains(keyword) ){
//do something
}

如果文档中包含诸如 account之类的字词，也会返回true；

It will also return true if document contains a word like "account";

，因此我尝试使用正则表达式，如下所示：

so I tried to use regular expression as follows:

String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
   //do something
}

摘要：

摘要如下：希望对其他人有用：

This is the summary: Hopefully it will be useful to some one else:

我的正则表达式可以工作，但在
处理大数据时非常不切实际。（它没有终止）

@anubhava完善了正则表达式。
易于理解和实施。它设法终止了，这是一件大事。但是还是有点慢。（大约需要240秒）

@Tomalak解决方案的实现和理解有些复杂，但是
是最快的解决方案。（18秒）

所以@Tomalak解决方案比@anubhava快15倍。

so @Tomalak solution was ~15 times faster than @anubhava.

推荐答案

在Java中查找子字符串最快的方法是使用 String.indexOf（） 。

The fastest-possible way to find substrings in Java is to use String.indexOf().

要实现仅整字匹配，您需要添加一点逻辑来检查可能匹配之前和之后的字符，以确保它们不是非单词字符：

To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:

public class IndexOfWordSample {
    public static void main(String[] args) {
        String input = "There are longer strings than this not very long one.";
        String search = "long";
        int index = indexOfWord(input, search);

        if (index > -1) {
            System.out.println("Hit for \"" + search + "\" at position " + index + ".");
        } else {
            System.out.println("No hit for \"" + search + "\".");
        }
    }

    public static int indexOfWord(String input, String word) {
        String nonWord = "^\\W?$", before, after;               
        int index, before_i, after_i = 0;

        while (true) {
            index = input.indexOf(word, after_i);
            if (index == -1 || word.isEmpty()) break;

            before_i = index - 1;
            after_i = index + word.length();
            before = "" + (before_i > -1 ? input.charAt(before_i) : "");            
            after = "" + (after_i < input.length() ? input.charAt(after_i) : "");

            if (before.matches(nonWord) && after.matches(nonWord)) {
                return index;
            }
        }
        return -1;
    }
}

这将打印：

在位置44击多头。

Hit for "long" at position 44.

这应该执行比纯正则表达式更好。

This should perform better than a pure regular expressions approach.

认为 ^ \W？$ 是否已经满足您对a的期望非单词字符。正则表达式是一个折衷方案，如果您的输入字符串包含许多几乎匹配项，则可能会降低性能。

Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.

为了提高速度，请抛开正则表达式并使用 字符类，检查它提供的许多属性的组合（例如 isAlphabetic 等），位于之前和之后。

For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.

我用做到这一点的替代实现创建了一个Gist

这篇关于大数据的有效正则表达式（如果字符串包含单词）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大数据的有效正则表达式（如果字符串包含单词） [英] Efficient Regular Expression for big data, if a String contains a word

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

大数据的有效正则表达式（如果字符串包含单词） [英] Efficient Regular Expression for big data, if a String contains a word

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭