找到包含给定单词的最短子串的方法:需要优化 [英] method to find the shortest substring containing the given words:optimization required

查看:140
本文介绍了找到包含给定单词的最短子串的方法:需要优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序要求我找到给定String的最短子段,其中包含一个单词列表。即使我的程序是正确的,我也无法在执行的时间范围内(5秒)交付。我认为问题是由于我使用的复杂(平凡)算法。它由嵌套循环组成,需要多次扫描list_of_words数组。这是我的搜索功能代码。 a [] 包含原始字符串,由单词存储, b [] 包含单词列表被发现形成子段。 字符串g 存储由原始字符串中的单词形成的临时子分段,包括列表中的单词。

I have a program that requires me to find the shortest subsegment of a given String, containing a list of words. Even though my program is correct, I am failing to deliver within a time frame of execution(5 seconds). I figured the problem is because of the complex(trivial) algorithm I am using. It consists of nested loops, and requires multiple scan of the list_of_words array.Here is my code for the search function. a[] contains the original string,stored by words, b[] contains the list of the words which is to be found to form the subsegment. String gstores the temporary subsegment formed by words from original string including the words in the list.

private static void search() // Function to find the subsegment with a required list of words
{
   int tail,x;//counters 
   String c[]=new String[b.length]; //initializing a temporary array to copy the list of words array.

   for(int i =0; i<a.length;i++)// looping throw original string array
    {
       System.arraycopy(b, 0, c, 0, b.length);//copying the list of words array to the temporary array

        for (int j=0;j<b.length;j++)//looping throw the temporary array
        { 
            x=0; //counter for temporary array

            if(a[i].equalsIgnoreCase(b[j]))//checking for a match with list of words
            {
                tail=i;
//adds the words starting from the position of the first found word(tail) till all the words from the list are found
                while((x<b.length)&&(tail<a.length))

                {
                    g=g+" "+a[tail];//adds the words in the string g

                    for(int k=0;k<c.length;k++) //checks for the found word from the temporary array and replaces it with ""    
                    {
                        if(c[k].equalsIgnoreCase(a[tail]))
                        {
                            c[k]=""; x++;
                        }
                    }
                    tail++;

                }
                if((tail==a.length)&&(x!=b.length))//checks if the string g does not contains all the listed word
                {
                    g="";
                }
                else
                    {
                    count(g);g="";//check for the shortest string.
                    }
            }
        }

    }print();
}

示例:

原始字符串:这是一个测试。这是一个编程测试。这是一个编程测试。

Original String :This is a test. This is a programming test. a programming test this is.

找到的词语:this,test,a,programming。

Words to be found : this , test, a ,programming.

子分段:

这是一个测试这是一个编程

This is a test This is a programming

这是编程测试

编程测试编程测试这个

编程测试a编程测试这个

programming test a programming test this

测试编程测试这个

编程测试这个

最短子段:编程测试

有关数据结构或循环结构变化的任何建议,甚至是优化算法的算法变化同样会有所帮助。

Any suggestion regarding change in the data structures or looping structures or even changes in the algorithm that optimizes the same will be helpful.

推荐答案

动态编程解决方案:

为您要查找的每个单词都设置一个最后一个位置变量。

Have a last position variable for each of the words you're looking for.

总计您要查找的不同单词(永远不会减少,最大=您正在寻找的单词的数量)。

Have a total count of distinct seen words you're looking for (will never decrease, max = count of words you're looking for).

对于输入中的每个单词位置:

For each word position in the input:


    如果您要查找的单词列表中存在该单词,请更新该单词的最后位置。
  • 如果更新的最后位置未初始化,则增加总计数。

  • 如果总计数等于最大值,则遍历最后一个位置并找到最小位置。当前位置与该值之间的距离将是子字符串的长度。记录这些值并在所有位置找到最佳值。

优化是一堆,用于减少查找最小位置所需的时间(应与一些结构一起使用(可能是散列或树形图) )允许快速查找指向一个单词的堆中的指针。)

An optimization is to have a heap for last positions to reduce the time taken to find the smallest one (should be used together with some structure (possibly a hash- or tree-map) that allows fast lookup of pointers into the heap given a word).

示例:

输入:这是一个测试。这是一个编程测试。编程测试这是

寻找:这个,测试,a,编程

                1    2  3  4     5    6  7  8           9     10 11          12   13   14
                This is a  test. This is a  programming test. a  programming test this is
this         -1 1    1  1  1     5    5  5  5           5     5  5           5    13   13
test         -1 -1   -1 -1 4     4    4  4  4           9     9  9           12   12   12
a            -1 -1   -1 3  3     3    3  7  7           7     10 10          10   10   10
programming  -1 -1   -1 -1 -1    -1   -1 -1 8           8     8  11          11   11   11
Count        0  1    1  2  3     3    3  3  4           4     4  4           4    4    4
Substr len   NA NA   NA NA NA    NA   NA NA 5           5     6  7           8    4    5
Shortest len NA NA   NA NA NA    NA   NA NA 5           5     5  5           5    4    4

最佳结果:编程测试,长度= 4。

Best result: a programming test this, length = 4.

复杂性分析:

n 是输入中的字数和 k 我们的字数寻找。

Let n be the number of words in the input and k the number of words we're looking for.

a lgorithm只会通过输入,并在每一步, O(log k) getMin 工作操作(使用堆优化)。

The algorithm only does one pass through the input and, at each step, does O(log k) work for the getMin operation (with the heap optimization).

因此,总时间为 O(n log k)

处理重复项:

如果我们在单词中允许重复项寻找(并且目标序列必须匹配所有出现的),上面的算法将不会按原样工作,但一个简单的解决方法是让每个不同的单词都有自己的指针堆到原始堆中(此堆中的值为与原始堆中的值相同),此堆的最大大小是我们要查找的单词中该单词的出现次数。

If duplicates are allowed in the words we're looking for (and the target sequence must match all occurrences), the algorithm above won't work as is, but a simple fix is to have each distinct word have its own heap of pointers into the original heap (the value in this heap being the same as the value in the original heap), with the maximum size of this heap being the number of occurrences of that word in the words we're looking for.

这篇关于找到包含给定单词的最短子串的方法:需要优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆