搜索多个字符串 [英] search for multiple strings

查看:154
本文介绍了搜索多个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在文件(kmp)中查找一个字符串的有效方法,或者是一个文件中的各种字符串(trie)。

但是,多年来,我一直在想如果有一种方法(有时候认为它是不可能的)搜索多个文件以获取多个字符串



假设我有一百万个文件,并且我想回答查询,如查找具有字符串香蕉,摩托艇和白狐的文件。什么是一个有效的算法?有一个吗?



当然,可以在线性时间内搜索要搜索的文件的大小。但是对于大量的大文件来说这似乎是不可行的。
谷歌的存在似乎表明,实际上有一个非常快的算法来做到这一点。甚至可能是每个查询只取决于查询大小,而不是文本大小的数据库(当然,这样的算法会涉及到一些预处理输入文件)
$ b $我认为必须有一个这样的算法(谷歌这样做!),但我的搜索没有发现任何东西。 平行编程

这对于并行编程来说绝对是一项大规模任务:将文件分配给不同的计算单元,让它们搜索,然后收集结果。这实际上是Google所做的,例如他们通过组合千种商品硬件PC解决了一些翻译问题。 (虽然他们可能会使用其他硬件来实现真正的Google搜索结果。)您可以阅读热门文章在互联网上。



MapReduce作为一个概念



Google例如发明了一个称为 MapReduce的范例,它们在白皮书中写下。这基本上归结为第一步将输入映射到输出(广泛分布)。然后在第二步中将所有小结果归纳为一个主要结果。

可以像这样执行搜索:


  • 地图:将文档与要搜索的关键字一起分发。如果在当前文件中找到搜索词,则从计算节点返回文件名。
  • 减少:从所有节点收集列表中的所有文件名。



(这与他们在论文中提出的分布式grep问题几乎相同。)

问题找出给定的字符串中是否存在给定的字符串在名称字符串匹配下进行了很好的研究,例如 Rabin-Karp算法 Knuth-Morris-Karp算法 (只是为了得到你的手)。因此, map 的实现相当简单。



为了分发文件,可以使用许多不同的技术。如果您想了解分布式文件系统的可能性,您可以收集有关Google文件系统(GFS)的信息, eg在相应的白皮书中。



减少几乎没有任何功能,所以这非常简单。



已完成。

这就是MapReduce范例的最大优点:一旦理解了map和reduce的组合如何,容易实现这两个功能。如果之前实现了MapReduce框架,则不必担心计算的并行性 - 否则会导致严重的头痛。其他概念 h2>

这绝对不是唯一可能的概念。


  • 可以根据您使用的硬件(独立的PC,如MapReduce假设,或更像是超级计算机,具有数十个CPU )。
  • 可以根据您使用的分布式文件系统(或未分发的文件系统)而定。

  • 可以改变编程语言,如果你对这个领域的学习感兴趣,你会发现很多其他的可能性,我是肯定在不久的将来会出现更多的情况,因为分布式系统的出现比以往任何时候都要多,但我希望我能够提供一些可能性,需要注意什么以及如何实现这一权利的方向的一些见解离开。


    I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)

    But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings

    Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?

    Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)

    I think there must be one such algorithm (google does it!) but my searches found nothing.

    解决方案

    Parallel Programming

    This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.

    "MapReduce" as one concept

    Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.

    One could implement the search like that:

    • map: Distribute the documents together with the keyword to search for. If the search word is found in the current file, return the filename from the computing node. Otherwise return nothing.
    • reduce: Gather all filenames in a list from all nodes.

    (This is practically the same as the "distributed grep" problem they presented in their paper.)

    The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.

    For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.

    reduce pretty much does nothing, so this is really easy.

    Finished.

    That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.

    Other concepts

    This is definitely not the only possible concept.

    • It is possible to vary on what hardware you use (independent PCs like MapReduce supposes, or is it more like a supercomputer with dozens of CPU).
    • It is possible to vary on what distributed (or not distributed) filesystem you use.
    • It is possible to vary the programming language, which too can make a huge difference.

    If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.

    这篇关于搜索多个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆