通过试验划分的素数条件测试 [英] Conditional tests in primality by trial division

查看:17
本文介绍了通过试验划分的素数条件测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是关于审判部门的条件测试.关于使用什么条件测试似乎存在一些争论.让我们看看 RosettaCode 的代码.

My question is about the conditional test in trial division. There seems to be some debate on what conditional test to employ. Let's look at the code for this from RosettaCode.

int is_prime(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

轮式分解或使用预定的素数列表不会改变我的问题的本质.

Wheel factorization or using a predetermined list of primes will not change the essence of my question.

我可以想到三种情况来做条件测试:

There are three cases I can think of to do the conditional test:

  1. p<=n/p
  2. p*p<=n
  3. int cut = sqrt(n);for (p = 3; p <= cut; p += 2)

案例 1:适用于所有 n,但每次迭代都必须进行额外的除法(实际上它不需要额外的除法,但它仍然较慢.我不知道为什么.请参阅程序集输出下面).我发现对于大的 n 值(在我的 Sandy Bridge 系统上),它的速度是案例 2 的两倍.

Case 1: works for all n but it has to do an extra division every iteration (edit: actually it does not need an extra division but it's still slower. I'm not sure why. See the assembly output below). I have found it to be twice as slow as case 2 for large values of n which are prime (on my Sandy Bridge system).

案例 2:比案例 1 快得多,但它有一个问题,即它对于大 n 溢出并进入不定式循环.它可以处理的最大值是

Case 2: is signficantly faster than case 1 but it has a problem that it overflows for large n and goes into an infinitive loop. The max value it can handle is

(sqrt(n) + c)^2 = INT_MAX  //solve
n = INT_MAX -2*c*sqrt(INT_MAX) + c^2
//INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2

例如对于 uint64_t 情况 2 将进入一个无限循环,因为 x =-1L-58 (x^64-59) 这是一个素数.

For example for uint64_t case 2 will go into an infinite loop for x =-1L-58 (x^64-59) which is a prime.

案例 3:每次迭代都不必进行除法或乘法运算,并且不会像案例 2 那样溢出.它也比案例 2 稍快.唯一的问题是 sqrt(n) 是否足够准确.

Case 3: no division or multiplication has to be done every iteration and it does not overflow like case 2. It's slightly faster than case 2 as well. The only question is if sqrt(n) is accurate enough.

有人可以向我解释为什么案例 2 比案例 1 快得多吗? 案例 1 不像我那样使用额外的除法,但尽管它仍然慢很多.

Can someone explain to me why case 2 is so much faster than case 1? Case 1 does NOT use an extra division as I though but despite that it's still a lot slower.

这是素数 2^56-5 的时间;

Here are the times for the prime 2^56-5;

case 1 9.0s
case 2 4.6s
case 3 4.5s

这是我用来测试这个的代码 http://coliru.stacked-crooked.com/a/69497863a97d8953.我还在这个问题的末尾添加了函数.

Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953. I also added the functions to the end of this question.

这是 GCC 4.8 的汇编输出,案例 1 和案例 2 使用 -O3.它们都只有一个分区.案例 2 也有一个乘法,所以我的第一个猜测是案例 2 会更慢,但它在 GCC 和 MSVC 上的速度大约是前者的两倍.我不知道为什么.

Here is the assembly output form GCC 4.8 with -O3 for case 1 and case 2. They both only have one division. Case 2 has a multiplication as well so my first guess is that case 2 would be slower but it's about twice as fast on both GCC and MSVC. I don't know why.

案例一:

.L5:
  testl %edx, %edx
  je  .L8
.L4:
  addl  $2, %ecx
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  cmpl  %ecx, %eax
  jae .L5

案例 2:

.L20:
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  testl %edx, %edx
  je  .L23
.L19:
  addl  $2, %ecx
  movl  %ecx, %eax
  imull %ecx, %eax
  cmpl  %eax, %edi
  jae .L20

这是我用来测试时间的函数:

Here are the functions I'm using to test the time:

int is_prime(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime2(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p*p <= n; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime3(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    uint32_t cut = sqrt(n);
    for (p = 3; p <= cut; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

在赏金之后添加内容.

Aean 发现,在案例 1 中保存商和余数的速度与案例 2 一样快(或略快).我们称之为案例 4.下面的代码是案例 1 的两倍.

Aean discovered that in case 1 saving the quotient as well as the remainder is just as fast (or slightly faster) than case 2. Let's call this case 4. The following following code is twice as fast as case 1.

int is_prime4(uint64_t n)
{
    uint64_t p, q, r;
    if (!(n & 1) || n < 2 ) return n == 2;

    for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
        if (!r) return 0;
    return 1;
}

我不确定这有什么帮助.无论如何,不​​再需要使用案例 2.对于案例 3,硬件或软件中的大多数版本的 sqrt 函数都获得了完美的正方形,因此通常可以安全使用.案例 3 是唯一适用于 OpenMP 的案例.

I'm not sure why this helps. In any case there is no need to use case 2 anymore. For case 3, most versions of the sqrt function in hardware or software get the perfect squares right so it's safe to use in general. Case 3 is the only case that will work with OpenMP.

推荐答案

UPD: 这显然是一个编译器优化问题.虽然 MinGW 在循环体中只使用了一条 div 指令,但 Linux 上的 GCC 和 MSVC 都未能重用上一次迭代的商.

UPD: This is a compiler optimization issue, obviously. While MinGW used only one div instruction in loop body, both GCC on Linux and MSVC failed to reuse the quotient from previous iteration.

我认为我们能做的最好的事情是明确定义 quorem 并在同一个基本指令块中计算它们,以显示我们想要商和余数的编译器.

I think the best we could do is explicitly define quo and rem and calculate them in the same basic instruction block, to show the compiler we want both quotient and remainder.

int is_prime(uint64_t n)
{
    uint64_t p = 3, quo, rem;
    if (!(n & 1) || n < 2) return n == 2;

    quo = n / p;
    for (; p <= quo; p += 2){
        quo = n / p; rem = n % p;
        if (!(rem)) return 0;
    }
    return 1;
}

<小时>

我从 http://coliru.stacked-crooked.com/a/尝试了您的代码69497863a97d8953 在 MinGW-w64 编译器上,case 1case 2 快.


I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 on a MinGW-w64 compiler, case 1 is faster than case 2.

所以我猜想您正在编译针对 32 位架构并使用 uint64_t 类型.您的程序集显示它不使用任何 64 位寄存器.

So I guess you are compiling targeted to a 32-bit architecture and used uint64_t type. Your assembly shows it doesn't use any 64-bit register.

如果我猜对了,那是有原因的.

If I got it right, there is the reason.

在 32 位架构上,64 位数字在两个 32 位寄存器中表示,您的编译器将完成所有串联工作.做 64 位加法、减法和乘法很简单.但是取模和除法是通过一个小函数调用完成的> 在 MSVC 中.

On 32-bit architecture, 64-bit numbers is represented in two 32-bit registers, your compiler will do all concatenation works. It's simple to do 64-bit addition, subtraction and multiplication. But modulo and division is done by a small function call which named as ___umoddi3 and ___udivdi3 in GCC, aullrem and aulldiv in MSVC.

所以实际上你需要一个 ___umoddi3 和一个 ___udivdi3 用于 case 1 中的每次迭代,一个 ___udivdi3case 2 中的一个 64 位乘法连接.这就是为什么在您的测试中 case 1 似乎比 case 2 慢两倍.

So actually you need one ___umoddi3 and one ___udivdi3 for each iteration in case 1, one ___udivdi3 and one concatenation of 64-bit multiplication in case 2. That's why case 1 seems twice slower than case 2 in your test.

你在case 1中真正得到了什么:

What you really get in case 1:

L5:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___udivdi3         // A call for div
    cmpl    %edi, %edx
    ja  L6
    jae L21
L6:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___umoddi3        // A call for modulo.
    orl %eax, %edx
    jne L5

案例 2 中真正得到的东西:

L26:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, %eax
    movl    %edi, %ecx
    imull   %esi, %ecx
    mull    %esi
    addl    %ecx, %ecx
    addl    %ecx, %edx
    cmpl    %edx, %ebx
    ja  L27
    jae L41
L27:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebp, (%esp)
    movl    %ebx, 4(%esp)
    call    ___umoddi3         // Just one call for modulo
    orl %eax, %edx
    jne L26

MSVC 未能重用 div 的结果.return 破坏了优化.试试这些代码:

MSVC failed to reuse the result of div. The optimization is broken by return. Try these code:

__declspec(noinline) int is_prime_A(unsigned int n)
{
    unsigned int p;
    int ret = -1;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p >= n / p) ret = 1; /* Let's return latter outside the loop. */
        if (!(n % p)) ret = 0;
    } while (ret < 0);
    return ret;
}

__declspec(noinline) int is_prime_B(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p > n / p) return 1; /* The common routine. */
        if (!(n % p)) return 0;
    } while (1);
}

is_prime_B 在 MSVC/ICC for windows 上将比 is_prime_A 慢两倍.

The is_prime_B will be twice slower than is_prime_A on MSVC / ICC for windows.

这篇关于通过试验划分的素数条件测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆