搜索数千个字符串的大文本文件 [英] Search Large Text File for Thousands of strings

查看:113
本文介绍了搜索数千个字符串的大文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型文本文件,大小为20 GB。文件包含相对较短的文本行(每行40至60个字符)。该文件未排序。



我有一个包含20,000个唯一字符串的列表。我想知道每个字符串每次出现在文件中的偏移量。目前,我的输出如下所示:

  netloader.cc在偏移量处找到:46350917 
netloader.cc发现在offset :48138591
netloader.cc发现在偏移:50012089
netloader.cc发现在偏移:51622874
netloader.cc发现在偏移:52588949
...
360doc .com发现在偏移:26411474
360doc.com在偏移量:26411508
360doc.com在偏移量:26483662
360doc.com在偏移量:26582000

我将20,000个字符串加载到std :: set(以确保唯一性),然后从文件读取一个128MB的块,然后使用string :: find搜索字符串(从另一个128MB的块开始)。这工作和完成在大约4天。我不担心读边界可能打破我正在搜索的字符串。如果是的话,没关系。



我想要更快。在1天内完成搜索将是理想的,但任何重大的性能改进将是很好的。我喜欢使用标准C ++与Boost(如果必要),同时避免其他库。



因此,我有两个问题:


  1. 考虑到我使用的工具和任务,4天的时间是否合理?

  2. 最快的方法是什么? li>

感谢。



编辑:使用Trie解决方案,运行时间到27小时。不是在一天之内,但肯定要快得多。感谢您的建议。

解决方案

在算法上,我认为处理这个问题的最好方法是使用树来存储行您要一次搜索一个字符。例如,如果您想要查找以下模式:

 手,有,有,脚,文件

生成的树看起来像这样:



树的生成是最糟糕的情况O ),并且通常具有子线性内存占用。



使用此结构,您可以通过一次读入一个字符开始处理您的文件<




  • 如果到达叶节点(红色部分),

  • 如果没有子节点,对应于您有红色的字母,则可以丢弃当前行,并开始检查下一行,从树的根开始



这种技术会导致线性时间O(n)检查匹配和扫描巨大的20gb

上述算法当然是它不会给出假阳性),但不会完成(它可能会错过一些结果)。然而,假设我们没有具有共同根的搜索项,例如 go gone ,则可以做一些细微的调整。以下是完整版本的算法的伪代码

  tree = construct_tree(['hand','has' ','foot','file'])
#跟踪我当前在树中的位置
nodes = []
用于在huge_file中的字符:
foreach节点在节点:$ b​​ $ b如果node.has_child(字符):
node.follow_edge(字符)
如果node.isLeaf():
#你找到一个匹配!
else:
nodes.delete(node)
if tree.has_child(character):
nodes.add(tree.get_child(character))

请注意,每次必须检查的节点最多要检查的最长字词的长度。因此,它不应该增加很多复杂性。


I have a large text file that is 20 GB in size. The file contains lines of text that are relatively short (40 to 60 characters per line). The file is unsorted.

I have a list of 20,000 unique strings. I want to know the offset for each string each time it appears in the file. Currently, my output looks like this:

netloader.cc found at offset: 46350917
netloader.cc found at offset: 48138591
netloader.cc found at offset: 50012089
netloader.cc found at offset: 51622874
netloader.cc found at offset: 52588949
...
360doc.com found at offset: 26411474
360doc.com found at offset: 26411508
360doc.com found at offset: 26483662
360doc.com found at offset: 26582000

I am loading the 20,000 strings into a std::set (to ensure uniqueness), then reading a 128MB chunk from the file, and then using string::find to search for the strings (start over by reading another 128MB chunk). This works and completes in about 4 days. I'm not concerned about a read boundary potentially breaking a string I'm searching for. If it does, that's OK.

I'd like to make it faster. Completing the search in 1 day would be ideal, but any significant performance improvement would be nice. I prefer to use standard C++ with Boost (if necessary) while avoiding other libraries.

So I have two questions:

  1. Does the 4 day time seem reasonable considering the tools I'm using and the task?
  2. What's the best approach to make it faster?

Thanks.

Edit: Using the Trie solution, I was able to shorten the run-time to 27 hours. Not within one day, but certainly much faster now. Thanks for the advice.

解决方案

Algorithmically, I think that the best way to approach this problem, would be to use a tree in order to store the lines you want to search for a character at a time. For example if you have the following patterns you would like to look for:

hand, has, have, foot, file

The resulting tree would look something like this:

The generation of the tree is worst case O(n), and has a sub-linear memory footprint generally.

Using this structure, you can begin process your file by reading in a character at a time from your huge file, and walk the tree.

  • If you get to a leaf node (the ones shown in red), you have found a match, and can store it.
  • If there is no child node, corresponding to the letter you have red, you can discard the current line, and begin checking the next line, starting from the root of the tree

This technique would result in linear time O(n) to check for matches and scan the huge 20gb file only once.

Edit

The algorithm described above is certainly sound (it doesn't give false positives) but not complete (it can miss some results). However, with a few minor adjustments it can be made complete, assuming that we don't have search terms with common roots like go and gone. The following is pseudocode of the complete version of the algorithm

tree = construct_tree(['hand', 'has', 'have', 'foot', 'file'])
# Keeps track of where I'm currently in the tree
nodes = []
for character in huge_file:
  foreach node in nodes:
    if node.has_child(character):
      node.follow_edge(character)
      if node.isLeaf():
        # You found a match!!
    else:
      nodes.delete(node)
  if tree.has_child(character):
    nodes.add(tree.get_child(character))

Note that the list of nodes that has to be checked each time, is at most the length of the longest word that has to be checked against. Therefore it should not add much complexity.

这篇关于搜索数千个字符串的大文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆