python中的Burrows-Wheeler中的性能问题 [英] Performance issues in Burrows-Wheeler in python

查看:63
本文介绍了python中的Burrows-Wheeler中的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在python中实现 Burrows-Wheeler 转换。 (这是在线课程中的一项作业,但我希望我已经做了一些工作,有资格寻求帮助。)

I was trying to implement Burrows-Wheeler transform in python. (This is one of the assignments in online course, but I hope I have done some work to be qualified to ask for help).

该算法的工作原理如下。取一个以特殊字符(在我的情况下为$)结尾的字符串,然后从该字符串创建所有循环字符串。按字母顺序对所有这些字符串进行排序,其特殊字符总是少于任何其他字符。之后,获得每个字符串的最后一个元素。

The algorithm works as follows. Take a string which ends with a special character ($ in my case) and create all cyclic strings from this string. Sort all these strings alphabetically, having a special character always less then any other character. After this get the last element of each string.

这给了我一个衬里:

''.join([i[-1] for i in sorted([text[i:] + text[0:i] for i in xrange(len(text))])]

对于相当大的字符串(这足以解决问题),这是正确且合理的速度:

Which is correct and reasonably fast for reasonably big strings (which is enough to solve the problem):

 60 000 chars - 16 secs
 40 000 chars - 07 secs
 25 000 chars - 02 secs

但是当我尝试用几百万个字符处理一个非常大的字符串时,我失败了(这花费了太多时间

But when I tried to process a really huge string with few millions of chars, I failed (it takes too much time to process).

我认为问题在于在内存中存储了太多的字符串。

I assume that the problem is with storing too many strings in the memory.

有什么方法可以克服这个问题?

Is there any way to overcome this?

PS只是想指出,这也可能看起来像是一个作业问题,我的解决方案已经通过了分级机,我正在寻找一个使它更快的方法。我也不会破坏别人的乐趣,因为如果y想找到解决方案,维基文章有一个类似于我的。我还检查了这个问题 n,听起来很像,但回答了一个更棘手的问题,即如何解码使用此算法编码的字符串。

P.S. just want to point out that also this might look like a homework problem, my solution already passes the grader and I am just looking for a way to make it faster. Also I am not spoiling the fun for other people, because if they would like to find solution, wiki article has one which is similar to mine. I also checked this question which sounds similar but answers a harder question, how to decode the string coded with this algorithm.

推荐答案

用长字符串制作所有这些字符串切片需要很长时间。至少是 O(N ^ 2)(因为您创建了N个长度为N的字符串,并且每个字符串都必须从原始数据中复制来复制到内存中),这会破坏整体性能并使得排序无关紧要。更不用说内存需求了!

It takes a long time to make all those string slices with long strings. It's at least O(N^2) (since you create N strings of N length, and each one has to be copied into memory taking its source data from the original), which destroys the overall performance and makes the sorting irrelevant. Not to mention the memory requirement!

与其实际地对字符串进行切片,下一个想法是订购 i 用于创建循环字符串的值,以比较结果字符串 的顺序-无需实际创建。事实证明这有些棘手。 (删除/编辑了一些错误的内容;请参阅@TimPeters的答案。

Instead of actually slicing the string, the next thought is to order the i values you use to create the cyclic strings, in order of how the resulting string would compare - without actually creating it. This turns out to be somewhat tricky. (Removed/edited some stuff here that was wrong; please see @TimPeters' answer.)

我在这里采取的方法是绕过标准库-很难(虽然并非不可能)按需比较这些字符串-并自己进行排序。算法的自然选择是基数排序,因为我们仍然需要一次将字符串视为一个字符。

The approach I've taken here is to bypass the standard library - which makes it difficult (though not impossible) to compare those strings 'on demand' - and do my own sorting. The natural choice of algorithm here is radix sort, since we need to consider the strings one character at a time anyway.

让我们开始吧第一。我正在为3.2版编写代码,因此请尝试一下。 (特别是在3.3及更高版本中,我们可以利用 的收益。)我正在使用以下导入:

Let's get set up first. I am writing code for version 3.2, so season to taste. (In particular, in 3.3 and up, we could take advantage of yield from.) I am using the following imports:

from random import choice
from timeit import timeit
from functools import partial

我写了这样一个通用基数排序函数:

I wrote a general-purpose radix sort function like this:

def radix_sort(values, key, step=0):
    if len(values) < 2:
        for value in values:
            yield value
        return

    bins = {}
    for value in values:
        bins.setdefault(key(value, step), []).append(value)

    for k in sorted(bins.keys()):
        for r in radix_sort(bins[k], key, step + 1):
            yield r

当然,我们不需要是通用的(我们的 bin只能用单个字符标记,并且可能您确实打算将算法应用于一系列字节;))),但这并没有伤害。可能还有一些可重用的东西,对吗?无论如何,这个想法很简单:我们处理一个基本情况,然后根据键函数的结果将每个元素放入一个 bin中,然后按照排序后的bin顺序将值从bin中拉出,对每个元素进行递归排序bin的内容。

Of course, we don't need to be general-purpose (our 'bins' can only be labelled with single characters, and presumably you really mean to apply the algorithm to a sequence of bytes ;) ), but it doesn't hurt. Might as well have something reusable, right? Anyway, the idea is simple: we handle a base case, and then we drop each element into a "bin" according to the result from the key function, and then we pull values out of the bins in sorted bin order, recursively sorting each bin's contents.

接口要求 key(value,n)给我们 的第n 个基数。因此,对于简单的情况,例如直接比较字符串,可以很简单地如 lambda v,n:return v [n] 。不过,这里的想法是根据该点上的数据(循环考虑)比较索引到字符串中。因此,我们定义一个键:

The interface requires that key(value, n) gives us the nth "radix" of value. So for simple cases, like comparing strings directly, that could be a simple as lambda v, n: return v[n]. Here, though, the idea is to compare indices into the string, according to the data in the string at that point (considered cyclically). So let's define a key:

def bw_key(text, value, step):
    return text[(value + step) % len(text)]

现在,获得正确结果的诀窍是记住我们从概念上讲,我们实际上并不是在创建字符串的最后一个字符。如果考虑使用索引 n 制作的虚拟字符串,则其最后一个字符位于索引 n-1 ,因为我们回过头来-片刻的想法将向您确认,当 n == 0 ;)时,它仍然有效。 [但是,当我们进行后退时,我们仍然需要将字符串索引保持在边界内-因此在key函数中进行模运算。]

Now the trick to getting the right results is to remember that we're conceptually joining up the last characters of the strings we aren't actually creating. If we consider the virtual string made using index n, its last character is at index n - 1, because of how we wrap around - and a moment's thought will confirm to you that this still works when n == 0 ;) . [However, when we wrap forwards, we still need to keep the string index in-bounds - hence the modulo operation in the key function.]

这是一个通用键转换进行比较时需要在文本中传递的函数。这就是 functools.partial 出现的地方-您还可以弄混 lambda ,但这可以说更干净,并且我发现它通常也更快。

This is a general key function that needs to be passed in the text to which it will refer when transforming the values for comparison. That's where functools.partial comes in - you could also just mess around with lambda, but this is arguably cleaner, and I've found it's usually faster, too.

无论如何,现在我们可以使用以下键轻松编写实际的转换:

Anyway, now we can easily write the actual transform using the key:

def burroughs_wheeler_custom(text):
    return ''.join(text[i - 1] for i in radix_sort(range(len(text)), partial(bw_key, text)))
    # Notice I've dropped the square brackets; this means I'm passing a generator
    # expression to `join` instead of a list comprehension. In general, this is
    # a little slower, but uses less memory. And the underlying code uses lazy
    # evaluation heavily, so :)

好看又漂亮。让我们看看它如何运作,好吗?我们需要一个标准与之比较:

Nice and pretty. Let's see how it does, shall we? We need a standard to compare it against:

def burroughs_wheeler_standard(text):
    return ''.join([i[-1] for i in sorted([text[i:] + text[:i] for i in range(len(text))])])

和计时例程:

def test(n):
    data = ''.join(choice('abcdefghijklmnopqrstuvwxyz') for i in range(n)) + '$'
    custom = partial(burroughs_wheeler_custom, data)
    standard = partial(burroughs_wheeler_standard, data)
    assert custom() == standard()
    trials = 1000000 // n
    custom_time = timeit(custom, number=trials)
    standard_time = timeit(standard, number=trials)
    print("custom: {} standard: {}".format(custom_time, standard_time))

请注意我做过的数学操作,以确定一些试用,与 test 字符串。这应该将用于测试的总时间保持在相当窄的范围内-对吗? ;)(当然,这是错误的,因为我们确定标准算法至少为O(N ^ 2)。)

Notice the math I've done to decide on a number of trials, inversely related to the length of the test string. This should keep the total time used for testing in a reasonably narrow range - right? ;) (Wrong, of course, since we established that the standard algorithm is at least O(N^2).)

让我们看看它的工作方式(* drumroll *):

Let's see how it does (*drumroll*):

>>> imp.reload(burroughs_wheeler)
<module 'burroughs_wheeler' from 'burroughs_wheeler.py'>
>>> burroughs_wheeler.test(100)
custom: 4.7095093091438684 standard: 0.9819262643716229
>>> burroughs_wheeler.test(1000)
custom: 5.532266880287807 standard: 2.1733253807396977
>>> burroughs_wheeler.test(10000)
custom: 5.954826800612864 standard: 42.50686064849015

哇,这有点令人恐惧的跳跃。无论如何,如您所见,新方法在短字符串上增加了大量开销,但使实际排序成为瓶颈,而不是字符串切片。 :)

Whoa, that's a bit of a frightening jump. Anyway, as you can see, the new approach adds a ton of overhead on short strings, but enables the actual sorting to be the bottleneck instead of string slicing. :)

这篇关于python中的Burrows-Wheeler中的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆