这是可以接受的算法吗? [英] Is this an acceptable algorithm?

查看:82
本文介绍了这是可以接受的算法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设计了一种算法来查找最长的公共子序列.这些是步骤:

I've designed an algorithm to find the longest common subsequence. these are steps:

  • 选择第一个字符串中的第一个字母.

  • Pick the first letter in the first string.

在第二个字符串中查找它,如果找到它,则将该字母添加到 common_subsequence并将其位置存储在index中,否则 比较common_subsequence的长度和lcs的长度 如果值更大,则将其值赋给lcs.

Look for it in the second string and if its found, Add that letter to common_subsequence and store its position in index, Otherwise compare the length of common_subsequence with the length of lcs and if its greater, asign its value to lcs.

返回第一个字符串并选择下一个字母,然后重复 再次执行上一步,但是这次从第index个字母

Return to the first string and pick the next letter and repeat the previous step again, But this time start searching from indexth letter

重复此过程,直到第一个字符串中没有字母为止 挑选.最后,lcs的值是最长公共 子序列.

Repeat this process until there is no letter in the first string to pick. At the end the value of lcs is the Longest Common Subsequence.

这是一个示例:

This is an example:
‫‪

X=A, B, C, B, D, A, B‬‬  
‫‪Y=B, D, C, A, B, A‬‬ 

在第一个字符串中选择A.
Y中查找A.
现在第二个字符串中有一个A,将其附加到common_subsequence.
返回第一个字符串并选择下一个字母B. 这次从A的位置开始在第二个字符串中查找B.
A之后有一个B,因此将B附加到common_subsequence.
现在,在第一个字符串C中选择下一个字母.第二个字符串中的B旁边没有C.因此,将common_subsequence的值分配给lcs,因为它的长度大于lcs的长度. 重复前面的步骤,直到到达第一个字符串的末尾.最后,lcs的值是最长公共子序列.

Pick A in the first string.
Look for A in Y.
Now that there is an A in the second string, append it to common_subsequence.
Return to the first string and pick the next letter that is B. Look for B in the second string this time starting from the position of A.
There is a B after A so append B to common_subsequence.
Now pick the next letter in the first string that is C. There isn't a C next to B in the second string. So assign the value of common_subsequence to lcs because its length is greater than the length of lcs. repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.

此算法的复杂度为theta(n * m). 这是我的实现:

The complexity of this algorithm is theta(n*m). Here is my implementations:

第一个算法:

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter has found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

使用哈希表的相同算法:

The same algorithm using hash table:

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached end of the ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

推荐答案

如果您的教授希望您发明自己的LCS算法,那么您就完成了.您的算法并不是有史以来最优化的算法,但是它属于正确的复杂性类,您可以清楚地理解它,并且您显然没有从互联网复制实现.您可能需要准备捍卫自己的算法,或讨论替代方法.如果我是您的教授,则在以下情况下给您A:

If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:

  • 您打开了该程序.
  • 您能够解释为什么没有可能的O(N)或O(N log M)替代方案.
  • 您能够参加关于其他算法的合理讨论,这些算法可能具有更好的 lower 界限(或常数,等等),以及时间/空间折衷等,甚至如果您事先不知道讨论的结果.
  • You turned in that program.
  • You were able to explain why there's no possible O(N) or O(N log M) alternative.
  • You were able to participate in a reasonable discussion about other algorithms that might have a better lower bound (or significantly lower constants, etc.), and the time/space tradeoffs, etc., even if you didn't know the outcome of that discussion in advance.

另一方面,如果您的教授希望您选择一种众所周知的算法并编写自己的实现,则您可能希望使用标准的LP算法.由于某种原因,这是一种标准算法-您可能需要继续阅读,直到了解为止. (即使不进行测试,您也要上这堂课来学习,而不仅仅是打动教授,对吧?)

On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)

维基百科具有用于基本实现的伪代码,然后是常用优化的英语描述.我非常确定,基于该页面上的内容编写自己的Python代码不会被视为窃,甚至不会被视为琐碎的端口,特别是如果您可以证明自己了解代码在做什么,为什么,为什么以及为什么的话这是一个很好的算法.另外,您正在用Python编写该代码,该代码具有比该文章中演示的更好的方式来进行记忆,因此,如果您了解它的工作原理,那么您的代码实际上应该比Wikipedia所提供的更好.

Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.

无论哪种方式,正如我在评论中建议的那样,我都会阅读对最长的通用子序列算法的调查,作者是Bergroth,Hakonen和Raita,并在网上搜索类似的论文.

Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.

这篇关于这是可以接受的算法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆