如何搜索多个字符串中的文本文件 [英] How to search for multiple strings in a text file

查看:120
本文介绍了如何搜索多个字符串中的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作在文本文件中。我希望在Java中实现搜索算法。我有一个文本文件,我需要搜索。

i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.

如果我想找到一个的话,我可以只是把所有文本到HashMap中做到这一点,并存储每个字的发生。但是它有什么算法,如果我想搜索两个字符串(或可能更多)?如果我在凑对的两根琴弦?

If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?

推荐答案

这在很大程度上取决于该文本文件的大小。通常有你应该考虑几种情况:

It depends a lot on the size of the text file. There are usually several cases you should consider:

  1. 地块的查询在非常短的文件(网页,文章长度等文本)。像正常的语言文字分布。一个简单的为O(n ^ 2)算法是好的。对于长度为n的查询只取长度为n的窗口,并滑动过来。比较和移动窗口,直到找到一个匹配。这种算法不关心的话,那么你只看到了整个搜索作为一个大的字符串(包括空格)。这可能是大多数浏览器一样。 KMP或博耶·摩尔是不值得的,因为为O(n ^ 2)的情况是非常罕见的。

  1. Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.

地块的查询上一个大文件。 preprocess您的文档,并将其存储preprocessed。通用存储选项包括后缀树,倒排列表。如果您有多个文件,您可以通过将它们和存储文件的末尾单独构建,当一个文档。这是要走的路文档数据库,其中收集几乎是恒定的。

Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.

如果你有,你有很高的冗余度和您的收藏经常变化,使用KMP或博耶·摩尔几个文件。例如,如果你想找到某个序列的DNA数据,你经常会得到新的序列,发现以及新的DNA从实验中,O朴素算法(N ^ 2)部分会杀了你的时间。

If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.

有可能是很多的更多的可能性,需要不同的算法和数据结构,所以你应该找出哪一个是最好的,你的情况。

There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.

这篇关于如何搜索多个字符串中的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆