一个高效的算法找出最小pangrammatic窗口? [英] An efficient algorithm for finding smallest pangrammatic windows?

查看:192
本文介绍了一个高效的算法找出最小pangrammatic窗口?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

A pangrammatic窗口 是一块较大的子串文本包含字母表中所有的26个字母。引述维基百科为例,给出这样的文字:

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:

我唱了,以为我唱的非常好;但他只是抬起头来望着我的脸很古怪EX pression,并说,你有多久一直唱歌,小姐?

I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'

在文本中的最小pangrammatic窗口是这样的字符串:

The smallest pangrammatic window in the text is this string:

G极好;但他只是抬起头来望着我的脸很古怪的前

g very well; but he just looked up into my face with a very quizzical ex

这的确包含了每一个字母至少一次。

Which indeed contains every letter at least once.

我的问题是这样的:?给定一个语料库,什么是最有效的算法用于查找文本中的最小pangrammatic窗口

My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?

我考虑了一些思考,并提出了以下的算法了。我有一个强烈的感觉,这些都不是最优的,但我想我会发布他们作为一个起点。

I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.

目前,在时间为O(N 2 )和空间O(1)一个简单朴素的算法:对于字符串中的每个位置,扫描着从该位置并跟踪哪些字母你见过(也许在一个位向量,其中,由于只有26个字母,需要空间O(1))。一旦你找到所有的26个字母,你必须在最短的pangrammatic窗口的长度起始于给定一点。每次扫描可能需要的时间为O(n),有O(n)的扫描,对于总计为O(n 2 )的时间。

There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.

我们也可以解决时间为O这个问题(N日志N),并使用改进的二进制搜索空间为O(n)。建设26阵列,一个用于每个字母,然后填充这些阵列在有序输入文本的每个字母的位置。我们可以通过简单地扫描整个文本,附加每个索引到相应于当前字符的数组做到这一点。一旦我们有了这个,我们可以发现,在时间O(log n)的,最短pangrammatic窗口的长在一些指数在阵列上运行26二进制搜索,发现每个字符出现的输入数组中的最早时间或者给定索引之后。取其这些数字是最大给出出现在字符串中的最远向下长极字符,从而给出了pangrammatic窗口的终点。运行该搜索步骤需要O(log n)的时间,因为我们必须这样做的字符串中的所有n个字符,总运行时间为O(n log n)的,与阵列O(n)的内存使用情况。

We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.

进一步细化上述方法是更换阵列,并与面包车昂德博阿斯树二进制搜索和predecessor搜索。这增加了创建时间为O(N日志log n)的,但降低每次搜索时间为O(log log n)的时间,对于邻网运行时(N日志日志N)为O(n)的空间使用。

A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.

有没有更好的算法在那里?

Are there any better algorithms out there?

推荐答案

这个算法有O(M)空间复杂度和O(n)的时间复杂度(时间不依赖于字母尺寸M):

This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):

  1. 在提前第一个迭代器,增加柜台的每个处理信件。停止在所有26计数器是非零。
  2. 在推进第二个迭代器和计数器下降对于每个处理信件。停止当任何这些计数器是零。
  3. 在迭代器之间使用不同的更新最好的那么远的结果,并继续执行步骤1。

该算法可以提高一点点,而不是字符计数器如果在字符串中的位置被存储。在这种情况下,步骤2应该只读这些位置,并与当前位置比较,和步骤1应该更新这些位置和(大部分时间)搜索在文本一些字符


This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

这篇关于一个高效的算法找出最小pangrammatic窗口?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆