增加列表中匹配字符串的速度 [英] Increasing speed of matching strings in list

查看:105
本文介绍了增加列表中匹配字符串的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有20MB以上的文本文件,某些位置的某些行包含*.因此,应从此文件中删除与包含*的位置匹配的位置(例如700670 *应删除70067000000至70067099999的所有位置).首先,我列出要删除代码的职位清单:

I have >20MB text files with some lines containing * at some positions. Accordingly should remove from this file positions matched with position containing * (e.g 700670* should cause to remove all positions 70067000000 to 70067099999). First I make list of positions to remove the code is:

Parallel.ForEach(List, (pos) =>
{ if (pos.IndexOf("*") != -1)
 { var lineWithStar = pos.Substring(0, pos.IndexOf("*"));
    var result = from single in List 
    where single.Substring(0, lineWithStar.Length) == lineWithStar
    select single;
    listWithPositionsToDel.AddRange(result.Skip(1).ToList());
  }
});

要花很多时间才能获得结果.

It takes ages to get result.

我需要从输入文件中删除"123456"行-匹配123 *的所有内容.

I need to remove line "123456" from input file - everything that matches 123*.

123 *

123456

1245

例如 结果应如下所示: 700204 * 700205100614136 * 700205100662305 * 7002051006623443904 700205100667271 * 700205120015472 * 来源是: 700204 * 700205100614136 * 7002041232323234332 700205100662305 * 7002051006141362332 7002051006623443904 700205100667271 * 700205120015472

E.g. Result should look like: 700204* 700205100614136* 700205100662305* 7002051006623443904 700205100667271* 700205120015472* Source is: 700204* 700205100614136* 7002041232323234332 700205100662305* 7002051006141362332 7002051006623443904 700205100667271* 700205120015472

推荐答案

您的嵌套循环会影响您的效果.另外,您还要进行很多额外的字符串和列表分配.

You have nested loop which is influencing your performance. Also you are doing lots of extra string and lists allocations.

我将这样做:一次遍历文件以查找所有需要删除的模式.然后重复另一个时间,对于每一行,立即确定是否需要删除该行或保留它.然后,您可以使用需要保留的行来创建新列表,也可以直接将其写入新文件,也可以仅在单独的集合中添加要删除的项目.像这样的东西

I would do this way: go through file once to find all patterns that you need to remove. Then iterate another time and for every line immediately decide if you need to remove that line or keep it. Then you can either create new list with lines you need to keep or write directly to new file or just add items to be removed in separate collection. Something like that

var linePatternsToRemove = new List<String>();
var resultList = new ConcurrentBag<String>();
foreach (var line in List)
{
    var asteriskIndex = line.IndexOf("*");
    if (asteriskIndex != -1)
    {
        linePatternsToRemove.Add(line.Substring(0, asteriskIndex));
    }
}

Parallel.ForEach(List, currentLine =>
{
    Boolean needDeleteLine = false;
    foreach (var pattern in linePatternsToRemove)
    {
        if (currentLine.StartsWith(pattern))
        {
            // If line starts with pattern like "700204" it may be the pattern line itself "700204*" and we don't need to delete it
            // or it can be regular line and we like "70020412" and we need to delete it.
            if (currentLine.Length > pattern.Length && currentLine[pattern.Length] != '*')
            {
                needDeleteLine = true;
                break;
            }
        }
    }
    if (!needDeleteLine)
        resultList.Add(currentLine);
});

更新:也许您不需要Parallel.Foreach和简单的for循环就足够快了.但是,如果需要并行处理,则应考虑使用线程安全的方法收集结果.

Update: Probably you won't need Parallel.Foreach and plain simple for loop will work fast enough. But if you need parallel, you should think about thread-safe collection for results.

Update2:已完成对代码的更改以反映新信息.请注意,使用并行循环时,输出结果集合将混乱.性能也将在很大程度上取决于文件中的模式数量.如果您有大量的模式,则需要更复杂的解决方案来针对各种模式对每一行进行测试.在这种情况下,可能使用树是个不错的选择.

Update2: done changes to code to reflect new information. Please be aware that when using parallel loop, output results collection will be out of order. Also performance will depend a lot on number of patterns in file. If you have big amount of patterns, more complicated solution is required to test every line against lots of various patterns. Probably using trees will be good option for you in that case.

这篇关于增加列表中匹配字符串的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆