大型列表的正则表达式优化 [英] Regex Optimization for large lists

查看:104
本文介绍了大型列表的正则表达式优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较两个字符串列表以找到可能的匹配项。示例:

I am comparing two lists of strings to find possible matches. Example:

public class Tester {

    public static void main(String[] args) {

        List<String> test = new ArrayList<String>();
        List<String> test2 = new ArrayList<String>();

        test.add("3H0875AAAA0012");
        test.add("3H0875AABB0018");
        test.add("3H0875AAAC0010");
        test2.add("3H0875AA");


        for(String s2: test2){
            for (String s: test){
                if (s.matches(".*" + s2 + ".*")){
                    System.out.println("Match");
                }
            }
        }
    }
}

基本上对于 test2 中的每个字符串,我想查看 test 中是否包含 test2 完全或部分。上述代码的输出应为:

Basically for every string in test2 I want to see if there are any strings in test that contain test2 completely or partially. The output for the above code should be:

Match 
Match 
Match

然而,在我的实际情况中,我在测试中有大约225K个字符串,在test2中有大约5K个字符串。这个比较需要太长的过程,并想看看是否有可能优化比较。分析test2中的前1.5K项需要大约10分钟。因此完成比较至少需要30到40分钟。

However, in my real case scenario I have around 225K strings in test and around 5K strings in test2. It is taking too long process this comparison and wanted to see if it was possible to optimize the comparison. It takes about 10 minutes to analyze the first 1.5K items in test2. So it will take at least 30 to 40 minutes to finish the comparison.

提前致谢

推荐答案

我认为你不应该使用正则表达式:我相信查看 String#contains (这里是链接到其javadoc条目)会给你更好的结果,就性能而言;)

I think that you shouldn't use regex for that: I believe that looking into String#contains (here is a link to its javadoc entry) would give you better results, in terms of performance ;)

例如,您的代码可能是:

For example, your code could be:

for(final String s2: test2){
    for (final String s: test){
        if(s.contains(s2)) {
            System.out.println("Match");
        }
    }
}

这篇关于大型列表的正则表达式优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆