优化整数系数列表及其长整数表示之间的转换 [英] Optimize conversion between list of integer coefficients and its long integer representation

查看:96
本文介绍了优化整数系数列表及其长整数表示之间的转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化我的多项式实现.特别地,我要处理系数为n(可能为>2^64)且多项式为x^r - 1(r< 2^64)的多项式.目前,我将系数表示为一个整数列表(*),并且已经以最直接的方式实现了所有基本运算.

我希望求幂和乘法尽可能快,并且要获得这个,我已经尝试了不同的方法.我目前的方法是将系数列表转换成巨大的整数,然后将这些整数乘以并解压缩系数.

问题在于包装和拆箱要花费很多时间.

那么,有没有一种方法可以改善我的打包"功能?

def _coefs_to_long(coefs, window):
    '''Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    '''

    res = 0
    adder = 0
    for k in coefs:
        res += k << adder
        adder += window
    return res
    #for k in reversed(coefs): res = (res << window) + k is slower


def _long_to_coefs(long_repr, window, n):
    '''Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    '''

    mask = 2**window - 1
    coefs = [0] * (long_repr.bit_length() // window + 1)
    for i in xrange(len(coefs)):
        coefs[i] = (long_repr & mask) % n
        long_repr >>= window

    # assure that the returned list is never empty, and hasn't got an extra 0.
    if not coefs:
        coefs.append(0)
    elif not coefs[-1] and len(coefs) > 1:
        coefs.pop()

    return coefs

请注意,我选择n,它是用户的输入,并且我的程序想证明其原始性(使用AKS测试),因此我无法将其分解


(*)我尝试了几种方法:

  1. 使用numpy数组而不是列表,并使用numpy.convolve进行乘法运算. n < 2^64很快,但是n > 2^64却非常慢[我也想避免使用外部库]
  2. 使用scipy.fftconvolve.对于n > 2^64完全无效.
  3. 从一开始就将系数表示为整数(无需每次都进行转换).问题在于,我不知道一种简单的方法来执行mod x^r -1操作,而不将整数转换为系数列表(这使使用这种表示形式的原因无法实现).

解决方案

尽管我仍然希望有人可以帮助我进一步提高转化率,并找到其他聪明的主意,但我还是找到了一种优化转化的方法. /p>

这些函数的基本问题是,在打包整数或解压缩整数时,它们具有某种二次存储分配行为. (有关此类行为的其他示例,请参见Guido van Rossum的帖子).

意识到这一点后,我决定尝试使用Divide et Impera原理,并获得了一些结果.我只是将数组分为两部分,分别进行转换,然后最终加入结果(稍后,我将尝试使用类似于Rossum帖子中的f5的迭代版本. ).

修改后的功能

def _coefs_to_long(coefs, window):
    """Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    """

    length = len(coefs)
    if length < 100:
        res = 0
        adder = 0
        for k in coefs:
            res += k << adder
            adder += window
        return res
    else:
        half_index = length // 2
        big_window = window * half_index
        low = _coefs_to_long(coefs[:half_index], window)
        high = _coefs_to_long(coefs[half_index:], window)
        return low + (high << big_window)


def _long_to_coefs(long_repr, window, n):
    """Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    """

    win_length = long_repr.bit_length() // window
    if win_length < 256:
        mask = 2**window - 1
        coefs = [0] * (long_repr.bit_length() // window + 1)
        for i in xrange(len(coefs)):
            coefs[i] = (long_repr & mask) % n
            long_repr >>= window

        # assure that the returned list is never empty, and hasn't got an extra 0.
        if not coefs:
            coefs.append(0)
        elif not coefs[-1] and len(coefs) > 1:
            coefs.pop()

        return coefs
    else:
        half_len = win_length // 2
        low = long_repr & (((2**window) ** half_len) - 1)
        high = long_repr >> (window * half_len)
        return _long_to_coefs(low, window, n) + _long_to_coefs(high, window, n) 

结果:

>>> import timeit
>>> def coefs_to_long2(coefs, window):
...     if len(coefs) < 100:
...         return coefs_to_long(coefs, window)
...     else:
...         half_index = len(coefs) // 2
...         big_window = window * half_index
...         least = coefs_to_long2(coefs[:half_index], window) 
...         up = coefs_to_long2(coefs[half_index:], window)
...         return least + (up << big_window)
... 
>>> coefs = [1, 2, 3, 1024, 256] * 567
>>> # original function
>>> timeit.timeit('coefs_to_long(coefs, 11)', 'from __main__ import coefs_to_long, coefs',
...               number=1000)/1000
0.003283214092254639
>>> timeit.timeit('coefs_to_long2(coefs, 11)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0007998988628387451
>>> 0.003283214092254639 / _
4.104536516782767
>>> coefs = [2**64, 2**31, 10, 107] * 567
>>> timeit.timeit('coefs_to_long(coefs, 66)', 'from __main__ import coefs_to_long, coefs',...               number=1000)/1000

0.009775240898132325
>>> 
>>> timeit.timeit('coefs_to_long2(coefs, 66)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0012255229949951173
>>> 
>>> 0.009775240898132325 / _
7.97638309362875

如您所见,此版本可将转换速度大大提高,从48快了好几倍(输入越大,速度越大). 使用第二个函数可以获得类似的结果:

>>> import timeit
>>> def long_to_coefs2(long_repr, window, n):
...     win_length = long_repr.bit_length() // window
...     if win_length < 256:
...         return long_to_coefs(long_repr, window, n)
...     else:
...         half_len = win_length // 2
...         least = long_repr & (((2**window) ** half_len) - 1)
...         up = long_repr >> (window * half_len)
...         return long_to_coefs2(least, window, n) + long_to_coefs2(up, window, n)
... 
>>> long_repr = coefs_to_long([1,2,3,1024,512, 0, 3] * 456, 13)
>>> # original function
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.005114212036132813
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.001701267957687378
>>> 0.005114212036132813 / _
3.006117885794327
>>> long_repr = coefs_to_long([1,2**33,3**17,1024,512, 0, 3] * 456, 40)
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.04037192392349243
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.005722791910171509
>>> 0.04037192392349243 / _
7.0545853417694

我试图避免在第一个函数中通过起始索引和结束索引来避免更多的内存重新分配,并避免切片,但是事实证明,对于小输入,这会使该函数的运行速度大大降低,并且对于实际输入. 即使我认为我不会获得更好的结果,也许我可以尝试将它们混合在一起.


我在上一阶段编辑了问题,因此有些人给了我一些建议,其目的与我最近的要求有所不同.我认为重要的是要澄清注释和答案中不同来源指出的结果,以便它们对希望实现快速多项式和AKS测试的其他人有用.

  • 正如J.F. Sebastian指出的那样,AKS算法得到了许多改进,因此尝试实现该算法的旧版本始终会导致程序运行缓慢.这并不排除以下事实:如果您已经很好地实现了AKS,则可以加快其改进多项式的速度.
  • 如果您对以较小的n为模的系数感兴趣(读取:字长数字),并且您不介意外部依赖性,则选择numpy并使用numpy.convolvescipy.fftconvolve进行乘法.这将比您可以编写的任何东西都要快得多.不幸的是,如果n不是单词大小,则根本无法使用scipy.fftconvolve,而且numpy.convolve也会变得很慢.
  • 如果您不必进行模运算(在系数和多项式上),那么即使我不承诺会有惊人的结果,即使使用ZBDD也是个好主意(如harold所指出的那样)我认为这真的很有趣,您应该阅读Minato的论文].
  • 如果您不必对系数进行模运算,那么最好使用Origin所声明的RNS表示.然后,您可以组合多个numpy阵列以有效运行.
  • 如果您希望系数为大n的多项式的纯Python实现,那么我的解决方案似乎是最快的.即使我没有尝试在python中的系数数组之间实现fft乘法(可能可能会更快).

I'm trying to optimize a polynomial implementation of mine. In particular I'm dealing with polynomials with coefficients modulo n(might be >2^64) and modulo a polynomial in the form x^r - 1(r is < 2^64). At the moment I represent the coefficient as a list of integers(*) and I've implemented all the basic operations in the most straightforward way.

I'd like the exponentiation and multiplication to be as fast as possible, and to obtain this I've already tried different approaches. My current approach is to convert the lists of coefficients into huge integers multiply the integers and unpack back the coefficients.

The problem is that packing and unpacking takes a lot of time.

So, is there a way of improving my "pack/unpack" functions?

def _coefs_to_long(coefs, window):
    '''Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    '''

    res = 0
    adder = 0
    for k in coefs:
        res += k << adder
        adder += window
    return res
    #for k in reversed(coefs): res = (res << window) + k is slower


def _long_to_coefs(long_repr, window, n):
    '''Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    '''

    mask = 2**window - 1
    coefs = [0] * (long_repr.bit_length() // window + 1)
    for i in xrange(len(coefs)):
        coefs[i] = (long_repr & mask) % n
        long_repr >>= window

    # assure that the returned list is never empty, and hasn't got an extra 0.
    if not coefs:
        coefs.append(0)
    elif not coefs[-1] and len(coefs) > 1:
        coefs.pop()

    return coefs

Note that I do not choose n, it is an input from the user, and my program wants to prove its primality(using the AKS test), so I can't factorize it.


(*) I've tried several approaches:

  1. Using a numpy array instead of a list and multiply using numpy.convolve. It's fast for n < 2^64 but terribly slow for n > 2^64[also I'd like to avoid using external libraries]
  2. Using scipy.fftconvolve. Doesn't work at all for n > 2^64.
  3. Represent the coefficients as integers from the start(without converting them every time). The problem is that I don't know of an easy way to do the mod x^r -1 operation without converting the integer to a list of coefficients(which defeats the reason of using this representation).

解决方案

I found a way to optimize the conversions, even though I still hope that someone could help me improve them even more, and hopefully find some other clever idea.

Basically what's wrong with those functions is that they have some kind of quadratic memory allocation behaviour, when packing the integer, or when unpacking it. (See this post of Guido van Rossum for an other example of this kind of behaviour).

After I realized this I've decided to give a try with the Divide et Impera principle, and I've obtained some results. I simply divide the array in two parts, convert them separately and eventually join the results(later I'll try to use an iterative version similar to the f5 in Rossum's post[edit: it doesn't seem to be much faster]).

The modified functions:

def _coefs_to_long(coefs, window):
    """Given a sequence of coefficients *coefs* and the *window* size return a
    long-integer representation of these coefficients.
    """

    length = len(coefs)
    if length < 100:
        res = 0
        adder = 0
        for k in coefs:
            res += k << adder
            adder += window
        return res
    else:
        half_index = length // 2
        big_window = window * half_index
        low = _coefs_to_long(coefs[:half_index], window)
        high = _coefs_to_long(coefs[half_index:], window)
        return low + (high << big_window)


def _long_to_coefs(long_repr, window, n):
    """Given a long-integer representing coefficients of size *window*, return
    the list of coefficients modulo *n*.
    """

    win_length = long_repr.bit_length() // window
    if win_length < 256:
        mask = 2**window - 1
        coefs = [0] * (long_repr.bit_length() // window + 1)
        for i in xrange(len(coefs)):
            coefs[i] = (long_repr & mask) % n
            long_repr >>= window

        # assure that the returned list is never empty, and hasn't got an extra 0.
        if not coefs:
            coefs.append(0)
        elif not coefs[-1] and len(coefs) > 1:
            coefs.pop()

        return coefs
    else:
        half_len = win_length // 2
        low = long_repr & (((2**window) ** half_len) - 1)
        high = long_repr >> (window * half_len)
        return _long_to_coefs(low, window, n) + _long_to_coefs(high, window, n) 

And the results:

>>> import timeit
>>> def coefs_to_long2(coefs, window):
...     if len(coefs) < 100:
...         return coefs_to_long(coefs, window)
...     else:
...         half_index = len(coefs) // 2
...         big_window = window * half_index
...         least = coefs_to_long2(coefs[:half_index], window) 
...         up = coefs_to_long2(coefs[half_index:], window)
...         return least + (up << big_window)
... 
>>> coefs = [1, 2, 3, 1024, 256] * 567
>>> # original function
>>> timeit.timeit('coefs_to_long(coefs, 11)', 'from __main__ import coefs_to_long, coefs',
...               number=1000)/1000
0.003283214092254639
>>> timeit.timeit('coefs_to_long2(coefs, 11)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0007998988628387451
>>> 0.003283214092254639 / _
4.104536516782767
>>> coefs = [2**64, 2**31, 10, 107] * 567
>>> timeit.timeit('coefs_to_long(coefs, 66)', 'from __main__ import coefs_to_long, coefs',...               number=1000)/1000

0.009775240898132325
>>> 
>>> timeit.timeit('coefs_to_long2(coefs, 66)', 'from __main__ import coefs_to_long2, coefs',
...               number=1000)/1000
0.0012255229949951173
>>> 
>>> 0.009775240898132325 / _
7.97638309362875

As you can see this version gives quite a speed up to the conversion, from 4 to 8 times faster(and bigger the input, bigger is the speed up). A similar result is obtained with the second function:

>>> import timeit
>>> def long_to_coefs2(long_repr, window, n):
...     win_length = long_repr.bit_length() // window
...     if win_length < 256:
...         return long_to_coefs(long_repr, window, n)
...     else:
...         half_len = win_length // 2
...         least = long_repr & (((2**window) ** half_len) - 1)
...         up = long_repr >> (window * half_len)
...         return long_to_coefs2(least, window, n) + long_to_coefs2(up, window, n)
... 
>>> long_repr = coefs_to_long([1,2,3,1024,512, 0, 3] * 456, 13)
>>> # original function
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.005114212036132813
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.001701267957687378
>>> 0.005114212036132813 / _
3.006117885794327
>>> long_repr = coefs_to_long([1,2**33,3**17,1024,512, 0, 3] * 456, 40)
>>> timeit.timeit('long_to_coefs(long_repr, 13, 1025)', 'from __main__ import long_to_coefs, long_repr', number=1000)/1000
0.04037192392349243
>>> timeit.timeit('long_to_coefs2(long_repr, 13, 1025)', 'from __main__ import long_to_coefs2, long_repr', number=1000)/1000
0.005722791910171509
>>> 0.04037192392349243 / _
7.0545853417694

I've tried to avoid more memory reallocation in the first function passing around the start and end indexes and avoiding slicing, but it turns out that this slows the function down quite much for small inputs and it's a just a bit slower for real-case inputs. Maybe I could try to mix them, even though I don't think I'll obtain much better results.


I've edited my question in the last period therefore some people gave me some advice with a different aim then what I required recently. I think it's important to clarify a bit the results pointed out by different sources in the comments and the answers, so that they can be useful for other people looking to implement fast polynomials and or AKS test.

  • As J.F. Sebastian pointed out the AKS algorithm receive many improvements, and so trying to implement an old version of the algorithm will always result in a very slow program. This does not exclude the fact that if you already have a good implementation of AKS you can speed it up improving the polynomials.
  • If you are interested in coefficients modulo a small n(read: word-size number) and you don't mind external dependencies,then go for numpy and use numpy.convolve or scipy.fftconvolve for the multiplication. It will be much faster than anything you can write. Unfortunately if n is not word size you can't use scipy.fftconvolve at all, and also numpy.convolve becomes slow as hell.
  • If you don't have to do modulo operations(on the coefficients and on the polynomial), then probably using ZBDDs is a good idea(as pointed out by harold), even though I do not promise spectacular results[even though I think it's really interesting and you ought to read Minato's paper].
  • If you don't have to do modulo operations on the coefficients then probably using an RNS representation, as stated by Origin, is a good idea. Then you can combine multiple numpy arrays to operate efficiently.
  • If you want a pure-python implementation of polynomials with coefficient modulo a big n, then my solution seems to be the fastest. Even though I did not try to implement fft multiplication between arrays of coefficients in python(which may be faster).

这篇关于优化整数系数列表及其长整数表示之间的转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆