大数据的有效正则表达式(如果字符串包含单词) [英] Efficient Regular Expression for big data, if a String contains a word
问题描述
我有一个可以运行的代码,但是速度非常慢。此代码确定字符串是否包含关键字。我必须高效处理要在成千上万个文档中搜索的数百个关键字。
I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
我该怎么做才能找到关键字(不会错误返回)
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
例如:
String keyword="ac";
String document"..." //few page long file
如果我使用:
if(document.contains(keyword) ){
//do something
}
如果文档中包含诸如 account之类的字词,也会返回true;
It will also return true if document contains a word like "account";
,因此我尝试使用正则表达式,如下所示:
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
摘要:
摘要如下:希望对其他人有用:
This is the summary: Hopefully it will be useful to some one else:
- 我的正则表达式可以工作,但在
处理大数据时非常不切实际。 (它没有终止) - @anubhava完善了正则表达式。
易于理解和实施。它设法终止了,这是一件大事。但是还是有点慢。 (大约需要240秒) - @Tomalak解决方案的实现和理解有些复杂,但是
是最快的解决方案。 (18秒)
所以@Tomalak解决方案比@anubhava快15倍。
so @Tomalak solution was ~15 times faster than @anubhava.
推荐答案
在Java中查找子字符串最快的方法是使用 String.indexOf()
。
The fastest-possible way to find substrings in Java is to use String.indexOf()
.
要实现仅整字匹配,您需要添加一点逻辑来检查可能匹配之前和之后的字符,以确保它们不是非单词字符:
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
这将打印:
在位置44击多头。
Hit for "long" at position 44.
这应该执行比纯正则表达式更好。
This should perform better than a pure regular expressions approach.
认为 ^ \W?$
是否已经满足您对a的期望非单词字符。正则表达式是一个折衷方案,如果您的输入字符串包含许多几乎匹配项,则可能会降低性能。
Think if ^\W?$
already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
为了提高速度,请抛开正则表达式并使用 字符
类,检查它提供的许多属性的组合(例如 isAlphabetic
等),位于之前
和之后
。
For extra speed, ditch the regex and work with the Character
class, checking a combination of the many properties it provides (like isAlphabetic
, etc.) for before
and after
.
这篇关于大数据的有效正则表达式(如果字符串包含单词)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!