在素性条件测试由审判庭 [英] Conditional tests in primality by trial division

查看:112
本文介绍了在素性条件测试由审判庭的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是有关审判庭条件测试。似乎有什么条件测试采用一些争论。让我们来看看在code这从罗塞塔code

  INT is_prime(无符号整数N)
{
    无符号整型磷;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P< = n可以溢出* /
    为(p值= 3; P&下; = N / P; P + = 2)
        如果返回0((N%P)!);
    返回1;
}

轮式分解或使用质数的predetermined列表不会改变我的问题的实质。

有三种情况我能想到的做测试条件:


  1. P< = N / P

  2. P * P< = N

  3. INT切=开方(N);为(P = 3; P< =切换; P + = 2)

案例1:适用于所有的n,但它必须做一个额外的分工每次迭代(编辑:实际上它并不需要一个额外的分工,但它仍然较慢,我不知道为什么看到汇编输出如下图)。我发现它是两倍的情况下,2那里是主要的(我的Sandy Bridge系统上)N值很大缓慢。

案例2:是signficantly比1的情况下快,但它有它溢出了大n和进入不​​定式循环有问题。最大值,它可以处理为

 (开方(N)+ C)^ 2 = INT_MAX //解决
N = INT_MAX -2 * C *开方(INT_MAX)+ C ^ 2
// INT_MAX = 2 ^ 32 - > N = 2 ^ 32 - ç* S ^ 17 + C ^ 2;在我们的例子中C = 2

例如对于uint64_t中2的情况下将进入一个无限循环的X = -1L-58(X ^ 64-59),这是一个素数。

案例3:无乘除必须做每个迭代,它不会溢出样病例2.比2的情况下稍快为好。唯一的问题是,如果的sqrt(n)是足够准确的

为什么情况下2比情况1,以便更快有人能解释一下吗?案例1不使用额外的分工,我虽然不过尽管它仍然是慢了很多。

下面是黄金2 ^ 56-5的时代;

  1的情况下9.0s
案例2 4.6s
案例3 4.5S

下面是code我用来测试这个 HTTP://coliru.stacked- crooked.com/a/69497863a97d8953 。我还添加的功能对这个问题的末尾。

下面是汇编输出形式GCC 4.8 -O3与病例1 2的情况下他们都只有一个师。案例2的乘积,以及让我的第一个猜测是这种情况下,2会较慢,但它是关于快两倍于GCC和MSVC。我不知道为什么。

案例1:

  .L5:
  为test1%EDX,EDX%
  JE .L8
.L4:
  ADDL $ 2,ECX%
  xorl%EDX,EDX%
  MOVL%EDI,EAX%
  DIVL%ECX
  CMPL%ECX,EAX%
  宰.L5

案例2:

  .L20:
  xorl%EDX,EDX%
  MOVL%EDI,EAX%
  DIVL%ECX
  为test1%EDX,EDX%
  JE .L23
.L19:
  ADDL $ 2,ECX%
  MOVL%ECX,EAX%
  imull%ECX,EAX%
  CMPL%EAX,EDI%
  宰.L20

下面是我使用测试时间功能:

  INT is_prime(uint64_t中N)
{
    uint64_t中磷;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P< = n可以溢出* /
    为(p值= 3; P&下; = N / P; P + = 2)
        如果返回0((N%P)!);
    返回1;
}INT is_prime2(uint64_t中N)
{
    uint64_t中磷;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P< = n可以溢出* /
    为(p值= 3; P * P&下; = N; P + = 2)
        如果返回0((N%P)!);
    返回1;
}INT is_prime3(uint64_t中N)
{
    uint64_t中磷;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P< = n可以溢出* /
    uint32_t的切=开方(N);
    为(P = 3; P< =切换; P + = 2)
        如果返回0((N%P)!);
    返回1;
}

赏金后的添加的内容。

Aean发现,在节约商以及剩下的,只是比2的情况下以最快的速度(或速度稍快)的情况下1让我们把这种情况4.下列以下code是快一倍的情况下1

  INT is_prime4(uint64_t中N)
{
    uint64_t中P,Q,R;
    如果((N&安培;!1)|| N'2)返回否== 2;    为(P = 3,Q = N / P,R = N%P; P< = Q,P + = 2,Q = N / P,R = N%P)
        如果返回0(R!);
    返回1;
}

我不知道为什么这会有所帮助。在任何情况下,没有必要使用壳2了。对于案例3,的大多数版本的sqrt 功能的硬件或软件获得完美的正方形右侧所以它是安全的一般使用。案例三是将使用OpenMP工作的唯一情况。


解决方案

UPD:这是一个编译器优化的问题,很明显。虽然MinGW的只使用了一个 DIV 在循环体指令,在Linux和MSVC GCC都未能重用previous迭代商。

我想我们能做的最好是明确定义现状 REM ,并在相同的基本指令计算它们块,以示我们希望双方商和余数的编译器。

  INT is_prime(uint64_t中N)
{
    uint64_t中P = 3,现状,REM;
    如果((N&安培;!1)|| N'2)返回否== 2;    现状= N / P;
    为(; P&下; =现状; P + = 2){
        现状= N / P; REM = N%P;
        (!(REM))如果返回0;
    }
    返回1;
}


我想你的code从 http://coliru.stacked-crooked.com/在MinGW的-W64的编译器/ 69497863a97d8953 案例1 2的情况下

所以我猜您正在编译针对32位架构和使用 uint64_t中键入。您的装配说明它不使用任何64位的寄存器。

如果我这样做是正确,还有就是原因。

在32位体系结构,64位数字重新在两个32位寄存器psented $ P $,你的编译器将尽一切串联的作品。这很简单做64位加法,减法和乘法。但模和除法是由一个小的函数调用它命名为 ___ umoddi3 ___ udivdi3 在GCC,<$ C $完成C> aullrem 和 aulldiv 在MSVC。

所以,实际上你需要有一个 ___ umoddi3 和一个 ___ udivdi3 每次迭代的情况下1 ,一是 ___ udivdi3 64位乘法的一个串联的情况下2 。这就是为什么案例1 似乎比 2的情况下在您的测试。

慢两倍

你真正得到案例1

  L5:
    ADDL $ 2,%ESI
    ADCL $ 0%EDI
    MOVL%ESI,8(%ESP)
    MOVL%EDI,12(%ESP)
    MOVL%EBX(%ESP)
    MOVL%EBP,4(%ESP)
    呼吁___udivdi3 //为DIV的调用
    CMPL%EDI,EDX%
    JA L6
    宰L21
L6:
    MOVL%ESI,8(%ESP)
    MOVL%EDI,12(%ESP)
    MOVL%EBX(%ESP)
    MOVL%EBP,4(%ESP)
    呼吁___umoddi3 //用于模的呼叫。
    ORL%EAX,EDX%
    JNE L5

你真正得到 2的情况下

  L26:
    ADDL $ 2,%ESI
    ADCL $ 0%EDI
    MOVL%ESI,EAX%
    MOVL%EDI,ECX%
    imull%ESI,ECX%
    马尔%ESI
    ADDL%ECX,ECX%
    ADDL%ECX,EDX%
    CMPL%EDX,EBX%
    JA L27
    宰L41
L27:
    MOVL%ESI,8(%ESP)
    MOVL%EDI,12(%ESP)
    MOVL%EBP(%ESP)
    MOVL%EBX,4(%ESP)
    调用___umoddi3 //只需一个电话用于模
    ORL%EAX,EDX%
    JNE L26

MSVC未能重用 DIV 的结果。优化是通过收益打破。
试试这些code:

  __ declspec(noinline始终)INT is_prime_A(无符号整数N)
{
    无符号整型磷;
    INT RET = -1;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P&LT; = n可以溢出* /
    p为1;
    做{
        的p + = 2;
        如果(P&GT; = N / P)RET = 1; / *让我们回到后者外循环。 * /
        如果RET = 0((N%P)!);
    }而(RET℃下);
    返回RET;
}__declspec(noinline始终)INT is_prime_B(无符号整数N)
{
    无符号整型磷;
    如果((N&安培;!1)|| N'2)返回否== 2;    / *比较P * P&LT; = n可以溢出* /
    p为1;
    做{
        的p + = 2;
        如果(对将N / P)返回1; / *常用程序。 * /
        如果返回0((N%P)!);
    }而(1);
}

is_prime_B 将会比 is_prime_A 两次慢上MSVC / ICC的窗口。

My question is about the conditional test in trial division. There seems to be some debate on what conditional test to employ. Let's look at the code for this from RosettaCode.

int is_prime(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Wheel factorization or using a predetermined list of primes will not change the essence of my question.

There are three cases I can think of to do the conditional test:

  1. p<=n/p
  2. p*p<=n
  3. int cut = sqrt(n); for (p = 3; p <= cut; p += 2)

Case 1: works for all n but it has to do an extra division every iteration (edit: actually it does not need an extra division but it's still slower. I'm not sure why. See the assembly output below). I have found it to be twice as slow as case 2 for large values of n which are prime (on my Sandy Bridge system).

Case 2: is signficantly faster than case 1 but it has a problem that it overflows for large n and goes into an infinitive loop. The max value it can handle is

(sqrt(n) + c)^2 = INT_MAX  //solve
n = INT_MAX -2*c*sqrt(INT_MAX) + c^2
//INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2

For example for uint64_t case 2 will go into an infinite loop for x =-1L-58 (x^64-59) which is a prime.

Case 3: no division or multiplication has to be done every iteration and it does not overflow like case 2. It's slightly faster than case 2 as well. The only question is if sqrt(n) is accurate enough.

Can someone explain to me why case 2 is so much faster than case 1? Case 1 does NOT use an extra division as I though but despite that it's still a lot slower.

Here are the times for the prime 2^56-5;

case 1 9.0s
case 2 4.6s
case 3 4.5s

Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953. I also added the functions to the end of this question.

Here is the assembly output form GCC 4.8 with -O3 for case 1 and case 2. They both only have one division. Case 2 has a multiplication as well so my first guess is that case 2 would be slower but it's about twice as fast on both GCC and MSVC. I don't know why.

Case 1:

.L5:
  testl %edx, %edx
  je  .L8
.L4:
  addl  $2, %ecx
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  cmpl  %ecx, %eax
  jae .L5

Case 2:

.L20:
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  testl %edx, %edx
  je  .L23
.L19:
  addl  $2, %ecx
  movl  %ecx, %eax
  imull %ecx, %eax
  cmpl  %eax, %edi
  jae .L20

Here are the functions I'm using to test the time:

int is_prime(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime2(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p*p <= n; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime3(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    uint32_t cut = sqrt(n);
    for (p = 3; p <= cut; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Added content after the bounty.

Aean discovered that in case 1 saving the quotient as well as the remainder is just as fast (or slightly faster) than case 2. Let's call this case 4. The following following code is twice as fast as case 1.

int is_prime4(uint64_t n)
{
    uint64_t p, q, r;
    if (!(n & 1) || n < 2 ) return n == 2;

    for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
        if (!r) return 0;
    return 1;
}

I'm not sure why this helps. In any case there is no need to use case 2 anymore. For case 3, most versions of the sqrt function in hardware or software get the perfect squares right so it's safe to use in general. Case 3 is the only case that will work with OpenMP.

解决方案

UPD: This is a compiler optimization issue, obviously. While MinGW used only one div instruction in loop body, both GCC on Linux and MSVC failed to reuse the quotient from previous iteration.

I think the best we could do is explicitly define quo and rem and calculate them in the same basic instruction block, to show the compiler we want both quotient and remainder.

int is_prime(uint64_t n)
{
    uint64_t p = 3, quo, rem;
    if (!(n & 1) || n < 2) return n == 2;

    quo = n / p;
    for (; p <= quo; p += 2){
        quo = n / p; rem = n % p;
        if (!(rem)) return 0;
    }
    return 1;
}


I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 on a MinGW-w64 compiler, case 1 is faster than case 2.

So I guess you are compiling targeted to a 32-bit architecture and used uint64_t type. Your assembly shows it doesn't use any 64-bit register.

If I got it right, there is the reason.

On 32-bit architecture, 64-bit numbers is represented in two 32-bit registers, your compiler will do all concatenation works. It's simple to do 64-bit addition, subtraction and multiplication. But modulo and division is done by a small function call which named as ___umoddi3 and ___udivdi3 in GCC, aullrem and aulldiv in MSVC.

So actually you need one ___umoddi3 and one ___udivdi3 for each iteration in case 1, one ___udivdi3 and one concatenation of 64-bit multiplication in case 2. That's why case 1 seems twice slower than case 2 in your test.

What you really get in case 1:

L5:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___udivdi3         // A call for div
    cmpl    %edi, %edx
    ja  L6
    jae L21
L6:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___umoddi3        // A call for modulo.
    orl %eax, %edx
    jne L5

What you really get in case 2:

L26:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, %eax
    movl    %edi, %ecx
    imull   %esi, %ecx
    mull    %esi
    addl    %ecx, %ecx
    addl    %ecx, %edx
    cmpl    %edx, %ebx
    ja  L27
    jae L41
L27:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebp, (%esp)
    movl    %ebx, 4(%esp)
    call    ___umoddi3         // Just one call for modulo
    orl %eax, %edx
    jne L26

MSVC failed to reuse the result of div. The optimization is broken by return. Try these code:

__declspec(noinline) int is_prime_A(unsigned int n)
{
    unsigned int p;
    int ret = -1;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p >= n / p) ret = 1; /* Let's return latter outside the loop. */
        if (!(n % p)) ret = 0;
    } while (ret < 0);
    return ret;
}

__declspec(noinline) int is_prime_B(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p > n / p) return 1; /* The common routine. */
        if (!(n % p)) return 0;
    } while (1);
}

The is_prime_B will be twice slower than is_prime_A on MSVC / ICC for windows.

这篇关于在素性条件测试由审判庭的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆