使用Java Regex,如何检查字符串是否包含集合中的任何单词? [英] Using Java Regex, how to check if a string contains any of the words in a set ?

查看:254
本文介绍了使用Java Regex,如何检查字符串是否包含集合中的任何单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一套单词说 - 苹果,橙,梨,香蕉,猕猴桃

I have a set of words say -- apple, orange, pear , banana, kiwi

我想检查句子是否包含上面列出的任何一个单词,如果确实如此,我想找到匹配的单词。如何在Regex中完成此操作?

I want to check if a sentence contains any of the above listed words, and If it does , I want to find which word matched. How can I accomplish this in Regex ?

我目前正在为每组单词调用String.indexOf()。我假设这不如正则表达式匹配效率高吗?

I am currently calling String.indexOf() for each of my set of words. I am assuming this is not as efficient as a regex matching?

推荐答案


TL; DR 对于简单的子串 contains()是最好的,但只匹配整个单词正则表达式可能更好。

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

查看哪种方法更有效的最佳方法是测试它。

The best way to see which method is more efficient is to test it.

您可以使用 String.contains()代替 String.indexOf()简化你的非正则表达式代码。

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

要搜索不同的单词,正则表达式如下所示:

To search for different words the Regular Expression looks like this:

apple|orange|pear|banana|kiwi

| 在正则表达式中用作 OR

The | works as an OR in Regular Expressions.

我非常简单的测试代码如下所示:

My very simple test code looks like this:

public class TestContains {

   private static String containsWord(Set<String> words,String sentence) {
     for (String word : words) {
       if (sentence.contains(word)) {
         return word;
       }
     }

     return null;
   }

   private static String matchesPattern(Pattern p,String sentence) {
     Matcher m = p.matcher(sentence);

     if (m.find()) {
       return m.group();
     }

     return null;
   }

   public static void main(String[] args) {
     Set<String> words = new HashSet<String>();
     words.add("apple");
     words.add("orange");
     words.add("pear");
     words.add("banana");
     words.add("kiwi");

     Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");

     String noMatch = "The quick brown fox jumps over the lazy dog.";
     String startMatch = "An apple is nice";
     String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";

     long start = System.currentTimeMillis();
     int iterations = 10000000;

     for (int i = 0; i < iterations; i++) {
       containsWord(words, noMatch);
       containsWord(words, startMatch);
       containsWord(words, endMatch);
     }

     System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");
     start = System.currentTimeMillis();

     for (int i = 0; i < iterations; i++) {
       matchesPattern(p,noMatch);
       matchesPattern(p,startMatch);
       matchesPattern(p,endMatch);
     }

     System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");
   }
}

我得到的结果如下:

Contains took 5962ms
Regular Expression took 63475ms

显然,时间会根据搜索的字数和搜索的字符串而有所不同,但包含()似乎确实如此对于像这样的简单搜索,比正则表达式快〜10倍。

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

通过使用正则表达式在另一个字符串中搜索字符串,你正在使用大锤来破解所以我想我们不应该对它的速度感到惊讶。保存正则表达式,以了解您想要查找的模式何时更复杂。

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

您可能希望使用正则表达式的一种情况是 indexOf( )包含()将无法完成工作,因为你只想匹配整个单词而不仅仅是子串,例如你想匹配 pear 但不是 spears 。正则表达式可以很好地处理这种情况,因为它们具有字边界。

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

在这种情况下,我们将模式更改为:

In this case we'd change our pattern to:

\b(apple|orange|pear|banana|kiwi)\b

\b 表示只匹配单词的开头或结尾,括号将OR表达式组合在一起。

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

注意,在代码中定义此模式时,需要使用另一个反斜杠来转义反斜杠:

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");

这篇关于使用Java Regex,如何检查字符串是否包含集合中的任何单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆