c#所有文件中最快的字符串搜索 [英] c# Fastest string search in all files

查看:107
本文介绍了c#所有文件中最快的字符串搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题(检查编辑以得到澄清

Problem (Check edit for clarifications)

我有大约1500个字符串的列表,每个字符串在这些字符串中,我必须检查目录(和子目录)中的任何文件是否包含该字符串(大约4000个文件)。

I have a list of about 1500 strings and for each of these strings I have to check if any of the files in a directory (and subdirectories) contains that string (there are about 4000 files).

代码

我现在所拥有的是这两个变体:

What I have now are these two variants:

原始

foreach(var str in stringList)
{
    allFiles.Any(f => File.ReadAllText(f).Contains(str));
}

第二种变体(使用ReadLines代替ReadAllText,如VladL在此问题中建议的

Second variant (using ReadLines instead of ReadAllText, as suggested from VladL in this question)

foreach(var string in stringList)
{
    allFiles.SelectMany(File.ReadLines).Any(line => line.Contains(str));
}

我只测试了原始变体的完整程序执行,只花了21分钟完成。然后,我测试了一条语句(检查任何文件中是否包含1个字符串),以查找我知道该字符串不包含的字符串以检查最坏的情况,这是我的计时(每3次执行):

I only tested the complete program execution of the original variant and it took 21 minutes to finish. I then tested a single statement (check if 1 string is contained in any file) searching for a string that I knew it wasn't contained to check the worst case scenario, and this are my timings (executed each 3 times):

原始 1285、1369、1336 ms

第二个变体 2718、2804、2831 ms

我还尝试在原始语句中用ReadAllLines替换ReadAllText (无需更改任何其他内容),但性能没有变化。

I also tryed to replace ReadAllText with ReadAllLines in the Original statement (without changing anything else), but with no performance changes.

问题

是否有更快的方法来检查任何文件中是否包含字符串(大量大文件)?

Is there any faster way for checking if a string is contained in any file (large amount of large files)?

编辑

我承认我没有像我这样清楚地表达自己想说我有一个字符串列表。我实际拥有的是一个csv文件列表,然后将它们设置为倾斜,然后遍历这些文件的每一行(忽略第一行)。在每一行中,我创建一个字符串并将其与该行字段的 some 组成,然后查看是否有文件包含该字符串。

I admit I didn't expressed myself as clear as I wanted, by saying I have a list of strings. What I actually have is a list of csv files, I then itarate trhough those and then iterate through each line of these file (ignoring the first line). With each line I create a string composing it with some of the fields of the line, and then look if any file contains that string.

foreach(var csvFile in csvFiles)
{
    var lines = File.ReadAllLines(csvFile);
    foreach(var line in lines)
    {
        if (IsHeader(line)) continue;
        var str = ComposeString(line);
        var bool = allFiles.Any(f => File.ReadAllText(f).Contains(str));
        // do stuff with the line and bool
     }
 }

编辑2

public void ExecuteAhoCorasick()
{
    var table = CreateDataTable();
    var allFiles = GetAllFiles();
    var csvFiles = GetCsvFiles();
    var resList = new List<string>();

    foreach(var csvFile in csvFiles)
    {
        if (file.Contains("ValueList_")) continue;
        var lines = File.ReadAllLines(file);
        foreach (var line in lines)
        {
            if (line == HeaderLine) continue;
            var res = line.Split(';');
            if (res.Length <= 7) continue;
            var resPath = $"{res[0]}.{res[1]}.{res[2]}".Trim('.');
            resList.Add(resPath);

            var row = table.NewRow();
            row[0] = res[0]; // Group
            row[1] = res[1]; // Type
            row[2] = res[2]; // Key
            row[3] = res[3]; // Global
            row[4] = res[4]; // De
            row[5] = res[5]; // Fr
            row[6] = res[6]; // It
            row[7] = res[7]; // En
            row[8] = resPath; // Resource Path
            row[9] = false;
            row[10] = ""; // Comment
            row[11] = file; // File Path

            table.Rows.Add(row);
        }
    }

    var foundRes = new List<string>();

    foreach (var file in allFiles)
    {
        // var chars = File.ReadLines(file).SelectMany(line => line);
        var text = File.ReadAllText(file);

        var trie = new Trie();
        trie.Add(resList);

        foundRes.AddRange(trie.Find(text));
        // foundRes.AddRange(trie.Find(chars));
    }

    // update row[9] to true foreach res in foundRes
}


推荐答案

我认为最快的方法是:


  1. 将每个文件完全读取到内存中。这样可以简化代码。

  2. 使用 Aho -Corasick算法,以在每个文件的文本中搜索关键字。

  1. Read each file completely into memory. This simplifies the code.
  2. Use the Aho-Corasick algorithm to search for the keywords in the text for each file.

Aho-Corasick的实现可用这里

An implementation of Aho-Corasick is available here.

我写了一个简单的程序,使用来自Github的实现来测试最坏情况下的性能(即,当文本中没有任何关键字时),以将Aho-Corasick与 Contains()):

I've written a simple program using that implementation from Github that tests the worst-case performance (that is, when none of the keywords are present in the text) to compare Aho-Corasick with Contains()):

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using ConsoleApp1;

namespace Demo
{
    class Program
    {
        static void Main()
        {
            string[] needles = createNeedles(1500).ToArray();
            string haystack = createHaystack(100000);

            var sw = Stopwatch.StartNew();
            anyViaContains(needles, haystack);
            Console.WriteLine("anyViaContains() took " + sw.Elapsed);

            sw.Restart();
            anyViaAhoCorasick(needles, haystack);
            Console.WriteLine("anyViaAhoCorasick() took " + sw.Elapsed);
        }

        static bool anyViaContains(string[] needles, string haystack)
        {
            return needles.Any(haystack.Contains);
        }

        static bool anyViaAhoCorasick(string[] needles, string haystack)
        {
            var trie = new Trie();
            trie.Add(needles);
            trie.Build();
            return trie.Find(haystack).Any();
        }

        static IEnumerable<string> createNeedles(int n)
        {
            for (int i = 0; i < n; ++i)
                yield return n + "." + n + "." + n;
        }

        static string createHaystack(int n)
        {
            var sb = new StringBuilder();

            for (int i = 0; i < n; ++i)
                sb.AppendLine("Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text Sample Text");

            return sb.ToString();
        }
    }
}

我得到的结果是64位RELEASE构建(在调试器外部运行)如下:

The results I get for a 64-bit RELEASE build (run outside the debugger) are as follows:


anyViaContains()花了00:00:09.8216836

anyViaContains() took 00:00:09.8216836

anyViaAhoCorasick()采取了00:00:00.4002765

anyViaAhoCorasick() took 00:00:00.4002765

对于此测试用例,它似乎与使用 Contains()相比,Aho-Corasick的速度要快25倍。但是,这是一个有点人为的测试用例,您的实际结果可能会有所不同。您应该用实际数据进行检测,看它是否确实有帮助。

For this test case, it appears that Aho-Corasick is around 25 times faster than using Contains(). However, this is a somewhat artificial test case and your actual results may vary. You should instrument with your actual data to see if it really does help.

请注意,使用Aho-Corasick实现时,实际上可以避免将整个文件加载到内存中,因为其 Find()方法接受 IEnumerable< char>

Note that you can actually avoid loading the entire file into memory when using the Aho-Corasick implementation, because its Find() method accepts an IEnumerable<char>.

您可以将文件内容转换为 IEnumerable< char> ,如下所示:

You can turn a the contents of a file into an IEnumerable<char> as follows:

var chars = File.ReadLines(filename).SelectMany(line => line);

实际上删除了所有换行符,这可能对您的应用程序来说是可以的。如果要保留换行符,则必须像这样放回它们:

That actually removes all the newline characters, which is probably OK for your application. If you wanted to keep the newline characters, you'd have to put them back like so:

IEnumerable<char> newline = Enumerable.Repeat('\n', 1);
var chars = File.ReadLines(filename).SelectMany(line => Enumerable.Concat(line, newline));

将每个文件完全加载到内存中与枚举每个文件中的字符比较是值得的(如上所述) ),看是否有任何区别。对于非常大的文件,避免将其全部内容加载到内存中可能很重要。

It would be worth comparing loading each file completely into memory with enumerating the chars in each file (as above) to see if there is any difference. For very large files, it may be important to avoid loading their entire contents into memory.

这篇关于c#所有文件中最快的字符串搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆