算法找到从搜索文档最小的片段? [英] Algorithm to find the smallest snippet from searching a document?

查看:124
本文介绍了算法找到从搜索文档最小的片段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经经历Skiena的优秀的算法设计手册,得到了挂在的运动之一。

现在的问题是: 鉴于三个字搜索字符串,找到包含所有三个搜索词,即用最小号的这词的片段文件的最小的片段。您将获得其中这些词语出现的搜索字符串的索引位置如字1:(1,4,5),WORD2:(4,9,10),和WORD3:(5,6,15)的每一个列表都在排序顺序,如上述

什么我想出了是O(n ^ 2)...这个问题是在排序和搜索一章,所以我想有一个简单而巧妙的方式来做到这一点。我想东西图的权利,但似乎有点小题大做。

想法? 谢谢

解决方案

我已经发布了一个相当简单的算法,解决了正是这种问题,在这个答案

<一个href="http://stackoverflow.com/questions/2734313/google-search-results-how-to-find-the-minimum-window-that-contains-all-the-searc/2734606#2734606">http://stackoverflow.com/questions/2734313/google-search-results-how-to-find-the-minimum-window-that-contains-all-the-searc/2734606#2734606

然而,在我们假定输入是重新通过文本流psented $ P $和字都存储在一个易于搜索的集。这个问题

在你的情况下,输入被重新psented略有不同$ P $:作为一帮为每个字排序位置载体。这再presentation很容易转变的,以什么是需要上述算法通过简单的合并所有这些载体到(位置,字)对责令位置的单个矢量。它可以从字面上来完成,或者它可以做到虚拟,通过将原始载体导入优先级队列(有序根据其第一元素)。从队列在这种情况下弹出的元素是指从第一向量弹出的第一个元素在队列中,并可能下沉第一矢量到队列按照其新的第一元素

当然,因为你的问题的声明明确固定的单词数量的的,你可以简单地检查所有三个数组的第一个元素和流行最小的一个,在每次迭代。这就给了你一个 O(N)算法,其中 N 是所有阵列的总长度。

另外,你的问题的声明似乎暗示这一目标的话可以在文本,这是相当奇怪的(假设你使用术语字)重叠。是不是故意的?在任何情况下,它不present用于上述链接算法的任何问题。

I've been going through Skiena's excellent "The Algorithm Design Manual" and got hung up on one of the exercises.

The question is: "Given a search string of three words, find the smallest snippet of the document that contains all three of the search words—i.e. , the snippet with smallest number of words in it. You are given the index positions where these words in occur search strings, such as word1: (1, 4, 5), word2: (4, 9, 10), and word3: (5, 6, 15). Each of the lists are in sorted order, as above."

Anything I come up with is O(n^2)... This question is in the "Sorting and Searching" chapter, so I assume there is a simple and clever way to do it. I'm trying something with graphs right now, but that seems like overkill.

Ideas? Thanks

解决方案

I already posted a rather straightforward algorithm that solves exactly that problem in this answer

http://stackoverflow.com/questions/2734313/google-search-results-how-to-find-the-minimum-window-that-contains-all-the-searc/2734606#2734606

However, in that question we assumed that the input is represented by a text stream and the words are stored in an easily searchable set.

In your case the input is represented slightly differently: as a bunch of vectors with sorted positions for each word. This representation is easily transformable to what is needed for the above algorithm by simply merging all these vectors into a single vector of (position, word) pairs ordered by position. It can be done literally, or it can be done "virtually", by placing the original vectors into the priority queue (ordered in accordance with their first elements). Popping an element from the queue in this case means popping the first element from the first vector in the queue and possibly sinking the first vector into the queue in accordance with its new first element.

Of course, since your statement of the problem explicitly fixes the number of words as three, you can simply check the first elements of all three arrays and pop the smallest one at each iteration. That gives you a O(N) algorithm, where N is the total length of all arrays.

Also, your statement of the problem seems to suggest that target words can overlap in the text, which is rather strange (given that you use the term "word"). Is it intentional? In any case, it doesn't present any problem for the above linked algorithm.

这篇关于算法找到从搜索文档最小的片段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆