Java:匹配字符串中的短语 [英] Java: Matching Phrases in a String

查看:497
本文介绍了Java:匹配字符串中的短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在数据库中的短语列表(短语可能由一个或多个单词组成)和一个输入字符串。我需要找出哪些短语出现在输入字符串中。

I have a list of phrases (phrase might consist of one or more words) in a database and an input string. I need to find out which of those phrases appear in the input string.

有没有一种有效的方式在Java中执行这样的匹配?

Is there an efficient way to perform such matching in Java?

推荐答案

快速黑客会是:


  1. 根据组合的短语建立正则表达式

  2. 构建一个列出迄今为止未匹配的短语的集合

  3. 重复运行 find ,直到找到所有短语或输入结束,从剩余短语集中删除匹配即可找到

  1. Build a regexp based on the combined phrases
  2. Construct a set listing the phrases that haven't matched so far
  3. Repeatedly run find until all phrases have been found or end of input is reached, removing matches from the set of remaining phrases to find

这样,输入只被遍历一次,无论你提供多少个短语。如果regexp编译器为多个选项生成一个高效的匹配器,这应该产生良好的性能。但是,这取决于你的短语和输入字符串,以及Java regexp引擎的质量。

That way, the input is traversed only once, regardless how many phrases you provide. If the regexp compiler generates an efficient matcher for multiple alternatives, this should yield decent performance. However, this depends a lot on your phrases and input string, as well as the quality of the Java regexp engine.

示例代码(测试,但未优化或分析性能):

Sample code (tested, but not optimized or profiled for performance):

public static boolean hasAllPhrasesInInput(List<String> phrases, String input) {
    Set<String> phrasesToFind = new HashSet<String>();
    StringBuilder sb = new StringBuilder();
    for (String phrase : phrases) {
        if (sb.length() > 0) {
            sb.append('|');
        }
        sb.append(Pattern.quote(phrase));
        phrasesToFind.add(phrase.toLowerCase());
    }
    Pattern pattern = Pattern.compile(sb.toString(), Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        phrasesToFind.remove(matcher.group().toLowerCase());
        if (phrasesToFind.isEmpty()) {
            return true;
        }
    }
    return false;
}

一些注意事项:


  • 上面的代码将短语匹配为字词的子字符串。如果只有完整的字词应该匹配,您将需要添加词边界(\ b)到生成的正则表达式。

  • 如果某些短语可能是其他

  • 如果您需要匹配非ASCII文本,则应添加regexp选项 Pattern.UNICODE_CASE ,然后调用 toLowerCase(Locale)而不是 toLowerCase()

  • The code above will match phrases as substrings of words. If only complete words should match, you will need to add word boundaries ("\b") to the generated regexps.
  • The code must be modified if some phrases may be substrings of other phrases.
  • If you need to match non-ASCII text, you should add the regexp option Pattern.UNICODE_CASE and call toLowerCase(Locale) instead of toLowerCase(), using a suitable Locale.

这篇关于Java:匹配字符串中的短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆