在大文本文件中搜索字符串的最快方法 [英] Fastest way to search string in large text file

查看:832
本文介绍了在大文本文件中搜索字符串的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是最快的技术/算法,可以用来在相当大的文本文件(最多1GB文本文件)中查找字符串列表。
首先,我正在使用C#并能够实现逻辑(只需将文件与字符串列表进行匹配,每次都逐字符串进行匹配。这意味着该文件将被读取为 n 匹配的字符串数 times),但是由于我要处理大量文件,因此要花很多时间才能遍历它们并获得匹配项。
我愿意接受任何建议,即使它是而不是C#。

What is the fastest technology/algorithm that can be implemented in order to lookup list of strings in a quite large text files (Up to 1GB text files). For starters, I'm using C# and was able to achieve the logic (Simply by matching a file with list of strings, string by string every time. Which means that the file will be read n number of strings to match with" times ), but since I'm dealing with a lot of files, it is taking forever running through them and get matches. I'm open to any suggestion even if it was not C#..

为了详细说明,我有一个文本文件,其中包含许多数字(A),并且我有很多大文件(B)。 m尝试获取(A)中的每个元素,并逐行查看(B)中是否有匹配项。如果存在匹配项,则将整行内容写入文本文件。真的很传统,单个文件要花很多时间,而我却有数百个文件,最大容量为1GB

To elaborate more, I have a text file that contains many numbers(A), and I have a lot of large files(B). I'm trying to take every element in (A) and see if there is a match for it in (B) line by line. If there is a match I write the whole line into a text file. The way I'm doing it is really traditional and it takes a lot of time to get done with a single file, while I have hundreds of them with sizes up to 1GB

感谢您的时间

推荐答案

做到这一点的标准方法是实现 Aho-Corasick算法。它会读取文件一次,并找到您提供给它的所有字符串的所有出现。请参阅 https://www.informit.com/guides/content。 aspx?g = dotnet& seqNum = 869 ,其中提供了实现和一些示例。

The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.

假设文件A中的数字列表足够小以适合内存,这是您要使用上面链接的文章中的实现进行的操作:

Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:

// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
    matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();

// And then do the search on each file
foreach (var fileName in listOfFiles)
{
    foreach (var line in File.ReadLines(filename))
    {
        var matches = matcher.Search(line);
        foreach (m in matches)
        {
            // output match
        }
    }
}

请注意,它只对每个文件进行一次遍历,而不必加载更多文件随时可以将文件的每一行存储到内存中。这里的限制因素是构建自动机所需的内存。

Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.

我用它来搜索文件总容量超过100 GB,用于大约1500万个不同的字符串。构建自动机需要几分钟,但是随后搜索非常快,该算法的一个很好的特性是它的复杂度为O(n + m),w这里n是输入文件的大小,m是匹配项的数量。搜索的字符串数量无关紧要。它可以搜索一百万个不同的字符串,就像搜索一两个字符串一样快。

I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.

100 GB将带您……大约40字节分钟阅读。如果花了一个小时才能在100 GB的数据中找到1500万个不同字符串的所有出现,我会感到非常惊讶。

100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.

如果要搜索整个单词,另一个选择是放弃Aho-Corasick算法。而是将您要查找的所有数字加载到 HashSet< string> 中。然后读取每一行,并使用正则表达式查找该行中的所有数字,并检查它们是否存在于哈希集中。例如:

Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:

Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
    var matches = re.Matchs(line);
    foreach (var m in matches)
    {
        if (hashSetOfValues.Contains(m))
        {
            // output match
        }
    }
}

这可能会比Aho-Corasick算法要慢一些,但是它仍然只通过一次数据。当然,这是假设您有足够的内存来将所有这些数字保存在哈希集中。

This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.

还有整个单词的其他选择,正如我在

There are other options for whole words, as I mention in the comments.

另一种选择是,如果您要查找的单词始终以空格分隔,则在单词的开头和结尾添加空格您添加到自动机。或者,通过对实现本身进行一些修改,您可以强制匹配器的 Search 方法仅返回整个单词中出现的匹配项。这样可以更轻松地处理行首和结尾处的匹配以及其他非单词字符。

Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.

这篇关于在大文本文件中搜索字符串的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆