搜索在文本文件中字符串的更快的方法 [英] Faster way of searching a string in text files

查看:142
本文介绍了搜索在文本文件中字符串的更快的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要寻找一个字符串,大约13个字符,一组使用C#的文本文件英寸的文本文件的数量在不断变化,可以100-1000之间的范围内。这些文件的大小可以1KB和10MB之间。

I need to search for a string, roughly 13 characters, in a group of text files using C#. The number of text files is changing and can range between 100-1000. The size of the files can range between 1KB and 10MB.

我试图打开每个文件的原始的方法,一行行读它,看看字符串是否存在(用index.of),但是这是太慢了。我也尝试使用博耶 - 穆尔算法,它确实提高的时机,以5秒,但还是这种感觉很慢。

I tried the naive way of opening each file, read it line by line and see if the string exists (using index.of), but this is too slow. I also tried using the Boyer-Moore algorithm, which did improve the timing, by 5 seconds, but still this feels slow.

如何加快搜索任何想法?

Any idea on how to speed up the search?

推荐答案

根据您想要多少次做搜索,你想用一个搜索引擎或没有。如果你要搜索了很多次,使用搜索引擎,否则:没有。 。我要介绍如何实现这两种方案在这里

Depending on how many times you want to do the 'search', you want to use a search engine or not. If you want to search a lot of times, use a search engine, otherwise: don't. I'm going to describe how to implement both scenario's here.

在使用搜索引擎:这听起来像你正在寻找串,这意味着你应该索引你的文件作为这样用你喜欢的搜索引擎,最好是可以自定义(Lucene的,梗等)。你需要在这里的技术是指数卦,那就是:所有的3个字符的组合,具有索引。 F.ex:'foobar的'会产生'富','OOB','荧光增白剂'和'酒吧'。在搜索时,你想要做同样的查询,并出具所有这些八卦的和搜索引擎查询。 (从上传文件列出了将运行一个合并连接,这将返回相应的ID或任何你把帖子列表)。

When using a search engine: It sounds like you're looking for substrings, which means you should index your files as such using your favorite search engine, preferably one you can customize (lucene, terrier, etc.). The technique you need here is to index trigrams, that is: all 3-character combinations have to be indexed. F.ex.: 'foobar' will generate 'foo','oob','oba' and 'bar'. When searching, you want to do the same with your query and issue a search engine query with the AND of all these trigrams. (That will run a merge-join on the posting lists from the documents, which will return their ID's or whatever you put in the posting lists).

另外,你可以实施后缀数组和索引文件一次。这会给多一点灵活性,如果你要搜索短(1-2字符)子,但在指标而言是难以维持。 (有一些研究在CWI /阿姆斯特丹快速索引后缀阵列)

Alternatively, you can implement suffix arrays and index your files once. This will give a little more flexibility if you want to search for short (1-2 char) substrings, but in terms of indexes is harder to maintain. (There is some research at CWI/Amsterdam for fast indexing suffix arrays)

当您要搜索只有几次,使用算法要么是博耶 - 摩尔(我通常使用博耶 - 穆尔周日在[格雷厄姆A.斯蒂芬字符串搜索])或编译的DFA描述(你可以从NFA,这更容易使)构建它们。然而,这只会给你一个小的速度增加,原因很简单,即磁盘IO可能是你的瓶颈,并比较了一堆,你需要反正解码也相当快的字节。

When you want to search only a few times, the algorithm to use is either Boyer-Moore (I usually use Boyer-moore-sunday as described in [Graham A. Stephen, String Search]) or a compiled DFA (you can construct them from an NFA, which is easier to make). However, that will only give you a small speed increase, for the simple reason that disk IO is probably your bottleneck and comparing a bunch of bytes that you need to decode anyways is quite fast.

你可以不读一行文件行,但在块中的最大改进。您应该配置NTFS使用64 KB的块大小,如果你能在64 KB的乘读取文件 - 想想4 MB或更多在一个读取。我甚至会建议使用异步IO,这样就可以同时读取和处理(以前的数据)。如果你这样做是正确,那应该已经给你10 MB上最先进的硬件瞬间实现。

The biggest improvement you can make is not reading your file line by line, but in blocks. You should configure NTFS to use a block size of 64 KB if you can and read the files in multiplies of 64 KB - think 4 MB or more in a single read. I'd even suggest using asynchronous IO so that you can read and process (previously read data) at the same time. If you do it correctly, that should already give you a split-second implementation for 10 MB on most modern hardware.

最后但并非最不重要的,在整个信息用一个巧妙的方法检索也使用快速压缩算法压缩数据。由于磁盘IO比内存/ CPU运算速度较慢,这可能会有所帮助。谷歌的斯纳皮压缩机是一种快速的压缩算法的一个很好的例子。

Last but not least, a neat trick used throughout information retrieval is also to compress you data using a fast compression algorithm. Since disk IO is slower than memory/cpu operations, this will probably help as well. Google's Snappy compressor is a good example of a fast compression algorithm.

这篇关于搜索在文本文件中字符串的更快的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆