Effcient的方法找到Python的最长重复的字符串(从编程珍珠) [英] Effcient way to find longest duplicate string for Python (From Programming Pearls)

查看:238
本文介绍了Effcient的方法找到Python的最长重复的字符串(从编程珍珠)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从编程珍珠15.2节

在C codeS可以看这里:<一href=\"http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c\">http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

当我使用后缀阵列中实现它的Python:

When I implement it in Python using suffix-array:

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i

我找到了 idx.sort 一步很慢。我认为这是缓慢的,因为Python中需要通过价值,而不是通过指针传递字符串(如C codeS以上)。

I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).

该测试的文件可以从这里下载

The tested file can be downloaded from here

在C codeS需要仅需0.3秒完成。

The C codes need only 0.3 seconds to finish.

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s

但对于Python codeS,它从来没有在我的电脑上结束(我等了10分钟,把它打死了)

But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)

有没有人有想法如何使codeS高效? (例如,小于10秒)

Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)

推荐答案

主要的问题似乎是Python不通过复制切片: http://stackoverflow.com/a/5722068/538551

The main problem seems to be that python does slicing by copy: http://stackoverflow.com/a/5722068/538551

您将不得不使用 memoryview ,而不是获得引用而不是复制。当我这样做,程序挂起的 idx.sort 功能(这是非常快)。

You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).

我敢肯定有一点的工作,就可以得到剩下的工作。

I'm sure with a little work, you can get the rest working.

编辑:

<击>上面的改变将无法正常工作,一个简易替换,因为 CMP 不工作方式与 STRCMP 。例如,尝试下面的C code:

The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:

#include <stdio.h>
#include <string.h>

int main() {
    char* test1 = "ovided by The Internet Classics Archive";
    char* test2 = "rovided by The Internet Classics Archive.";
    printf("%d\n", strcmp(test1, test2));
}

和比较的结果,以这条巨蟒:

And compare the result to this python:

test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))

的C code打印 -3 在我的机器上,而Python版本版画 1 。它看起来像例如 C code滥用 STRCMP 的返回值(它是在<$使用C $ C>的qsort 毕竟)。我找不到就当 STRCMP 将返回比其他东西的任何文档[ - 1,0,1] ,但添加的printf pstrcmp 在原来的code表现出了很多的值的范围之外(3 ,-31,5是第3个值)。

The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).

要确保 -3 不是某些错误code,如果我们反向TEST1和TEST2,我们会得到 3

To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.

编辑:

以上是有趣的琐事,但在影响code任一大块方面实际上并不正确。我意识到这一点,正如我闭上我的笔记本电脑,离开一个W​​iFi区......真的应该仔细检查一切我打之前保存

The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.

FWIW, CMP 最肯定也适用于 memoryview 对象(打印 1 如预期):

FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):

print(cmp(memoryview(test1), memoryview(test2)))

我不知道为什么像预期的那样code不工作。我的机器上打印出清单看起来并不如预期。我一下,并试图找到,而不是在抓救命稻草更好的解决方案。

I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.

这篇关于Effcient的方法找到Python的最长重复的字符串(从编程珍珠)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆