在300万文本文件中搜索匹配 [英] Searching for matches in 3 million text files
问题描述
我想过使用 Scanner
类,但不知道这样的大文件的性能。性能不是非常高的优先级,但它应该是在一个可接受的标准。
是不是一个复杂的搜索/索引算法,实现这个最有效和简单的方法?
$ / b
$ b $复杂的搜索/索引算法。没有必要在这里重新发明轮子。由于用户可以输入任何单词,因此不能进行简单的预处理步骤,而必须对文本中的所有单词进行索引。这就是Lucene为你所做的事情。
除了预处理和建立索引之外,没有其他的快速搜索方法。你可以推出你自己的解决方案,或者你可以使用Lucene。
没有预处理的纯文本搜索将会非常慢,无法使用。
I have a simple requirement where a user enters a bunch of words and the system scans over 3 million text files and finds files which has those keywords. What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?
I thought of using
Scanner
class for this but have no idea about performance over such large files. Performance isn't very high priority but it should be in a acceptable standard.解决方案What would be the most efficient and simple way to implement this without a complex searching / indexing algorithm ?
A complex searching/indexing algorithm. There's no need to reinvent the wheel here. Since the user can enter any words, you can't make a simple preprocessing step, but rather have to index for all words in the text. This is what something like Lucene does for you.
There is no other fast way to search through text other than by preprocessing it and building an index. You can roll your own solution for this or you can just use Lucene.
Naïve text search with no preprocessing will be far far too slow to be of any use.
这篇关于在300万文本文件中搜索匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!