分段总理筛 [英] Segmented Prime sieve

查看:167
本文介绍了分段总理筛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

碰到这个互联网上的有效分割黄金筛,请帮助我理解的工作特别是使用的下一步的载体

如何是段大小影响性能,具体选择?

  const int的L1D_CACHE_SIZE = 32768;
无效segmented_sieve(限的int64_t,INT segment_size = L1D_CACHE_SIZE)
{
    INT开方=(INT)的std ::开方((双)限制);
    伯爵的int64_t =(限2)? 0:1;
    的int64_t S = 2;
    的int64_t N = 3;

    //矢量用于筛分
    的std ::矢量<烧焦>筛(segment_size);

    //生成小素数与LT =开方
    的std ::矢量<烧焦> is_prime(SQRT + 1,1);
    的for(int i = 2; I * I< =开方;我++)
        如果(is_prime [I])
            对于(INT J = I * I,J< =开方; J + = 1)
                is_prime [J] = 0;

    的std ::矢量< INT>素数;
    的std ::矢量< INT>下一个;

    对于(低的int64_t = 0;低速< =限制;低+ = segment_size)
    {
        的std ::填充(sieve.begin(),sieve.end(),1);

        //当前段=区间为[低,高]
        高的int64_t =标准::分(低+ segment_size  -  1,限);

        //需要存储小的素数过掉倍数
        对于(; S * S< =高; S ++)
        {
            如果(is_prime [S])
            {
                primes.push_back((INT)S);
                next.push_back((INT)(S * S  - 低));
            }
        }
        //筛当前段
        对于(标准::我的size_t = 1; I< primes.size();我++)
        {
            INT J =下一个[I]
            对于(INT K =素数[我] * 2; J< segment_size; J + = K)
                筛[J] = 0;
            接下来的[I] = j的 -  segment_size;
        }

        对于(; N< =高; N + = 2)
            如果(筛[N  - 低])// n是一个素数
                算上++;
    }

    性病::法院<<算上<< 素中。 <<的std :: ENDL;
}
 

解决方案

下面是一个更简洁的配方相同的算法,这将使原则的充分,可运行的.cpp为段大小时序@引擎收录)。这初始化一个压缩(赔率只)筛代替数数的素数,但所涉及的原理是相同的。下载并运行的.cpp看到段大小的影响。基本上,最佳应该在L1高速缓存大小为你的CPU。太小了,而且由于增加圈数开始主导开销;太大了,你会得到的L2和L3高速缓存的速度较慢计时处罚。另请参见<一个href="http://stackoverflow.com/questions/26201489/how-does-segmentation-improve-the-running-time-of-sieve-of-eratosthenes">How并分割提高埃拉托色尼的筛的运行时间?。

 无效initialise_packed_sieve_4G(void *的数据,无符号segment_bytes = 1&LT;&LT; 15,无符号end_bit = 1U&LT;&LT; 31)
{
   类型定义的std ::矢量&lt; prime_and_offset_t&GT;:迭代prime_iter_t;
   的std ::矢量&lt; prime_and_offset_t&GT; small_factors;

   initialise_odd_primes_and_offsets_64K(small_factors);

   无符号segment_bits = segment_bytes * CHAR_BIT;
   无符号partial_bits = end_bit%segment_bits;
   无符号段= end_bit / segment_bits +(partial_bits!= 0);

   无符号字符*段=的static_cast&LT;无符号字符*&GT;(数据);
   无符号字节= segment_bytes;

   对于(; segments--;段+ = segment_bytes)
   {
      如果(段== 0安培;&安培; partial_bits)
      {
         segment_bits = partial_bits;
         字节=(partial_bits + CHAR_BIT  -  1)/ CHAR_BIT;
      }

      的std :: memset的(段0,字节);

      对于(prime_iter_t P = small_factors.begin(!); P = small_factors.end(); ++ P)
      {
         无符号的N = P-&GT;素;
         无符号I = P-&GT; next_offset;

         对于(; I&LT; segment_bits; I + = N)
         {
            SET_BIT(段一);
         }

          P-&GT; next_offset = I  -  segment_bits;
      }
   }
}
 

如果偏移不是从段段回忆那么他们就必须使用至少一个划分,每个重计算指数一次乘法,加上条件语句或严重位弄虚作假被重新计算每次。当筛分全2 ^ 32号范围是至少53583872缓慢的部门和稍快乘法相同数量(每32千字节8192段);这是大约一第二相加所需的初始化一个完整的2 ^ 32筛上的时间(2 ^ 31位的赔率只有埃拉托色尼)。

下面是从我的旧筛子一个使用这种reconstitutive数学的一些实际的code:

 的(index_t K = 1; K&LT; = max_factor_bit; ++ K)
{
   如果(bitmap_t ::特点:: BT(bm.bm,K))继续;

   index_t N =(K&其中;&小于1)+1; // == index_for_value(value_for_index(K)* 2)==ñ
   index_t I =平方(N)&GT;&GT; 1; // == index_for_value(方(N))

   如果(I&LT;偏移)
   {
      I + =((偏移 -  I)/ N)* N;
   }

   对于(; I&LT; = new_max_bit; I + = N)
   {
      bitmap_t ::特点:: BTS(bm.bm,我);
   }
}
 

这需要约5.5秒的全筛(VC ++);第一次出现在code仅需4.5使用gcc 4.8.1具有相同的编译器秒或3.5秒(MinGW64)。

下面是gcc的时序:

 筛位= 2147483648(当量数= 4294967295)

段大小4096(2 ^ 12)个字节... 4.091小号1001.2米/秒
段大小8192(2 ^ 13)个字节... 3.723小号1100.2米/秒
段大小为16384(2 ^ 14)个字节... 3.534小号1159.0米/秒
段大小32768(2 ^ 15)个字节... 3.418小号1198.4米/秒
段大小65536(2 ^ 16)个字节... 3.894小号1051.9米/秒
段大小131072(2 ^ 17)个字节... 4.265小号960.4米/秒
段大小262144(2 ^ 18)个字节... 4.453小号919.8米/秒
段大小524288(2 ^ 19)个字节... 5.002小号8189米/秒
段大小1048576(2 ^ 20)个字节... 5.176小号791.3米/秒
段大小2097152(2 ^ 21)个字节... 5.135小号797.7米/秒
段大小4194304(2 ^ 22)个字节... 5.251小号780.0米/秒
段大小8388608(2 ^ 23)个字节... 7.412小号552.6米/秒

摘要{203280221,0C903F86,5B253F12,774A3204}
 

注:一个额外的第二个可以从那个时候由一个名为presieving'的把戏,即爆破开始归零它的pre计算的模式进入的位图,而不是被剃光。这使海合会时间下降到2.1 S为全筛,与较早的.cpp 这个砍死副本。这招作品非常好,连同缓存大小的块分割的筛分。

Came across this efficient segmented prime sieve on the internet, please help me understand the working especially the use of next vector

How is the specific choice of segment size affecting performance?

const int L1D_CACHE_SIZE = 32768;
void segmented_sieve(int64_t limit, int segment_size = L1D_CACHE_SIZE)
{
    int sqrt = (int) std::sqrt((double) limit);
    int64_t count = (limit < 2) ? 0 : 1;
    int64_t s = 2;
    int64_t n = 3;

    // vector used for sieving
    std::vector<char> sieve(segment_size);

    // generate small primes <= sqrt
    std::vector<char> is_prime(sqrt + 1, 1);
    for (int i = 2; i * i <= sqrt; i++)
        if (is_prime[i])
            for (int j = i * i; j <= sqrt; j += i)
                is_prime[j] = 0;

    std::vector<int> primes;
    std::vector<int> next;

    for (int64_t low = 0; low <= limit; low += segment_size)
    {
        std::fill(sieve.begin(), sieve.end(), 1);

        // current segment = interval [low, high]
        int64_t high = std::min(low + segment_size - 1, limit);

        // store small primes needed to cross off multiples
        for (; s * s <= high; s++)
        {
            if (is_prime[s])
            {
                primes.push_back((int) s);
                next.push_back((int)(s * s - low));
            }
        }
        // sieve the current segment
        for (std::size_t i = 1; i < primes.size(); i++)
        {
            int j = next[i];
            for (int k = primes[i] * 2; j < segment_size; j += k)
                sieve[j] = 0;
            next[i] = j - segment_size;
        }

        for (; n <= high; n += 2)
            if (sieve[n - low]) // n is a prime
                count++;
    }

    std::cout << count << " primes found." << std::endl;
} 

解决方案

Here is a more concise formulation of the same algorithm, which should make the principle much clearer (part of full, runnable .cpp for segment size timings @ pastebin). This initialises a packed (odds-only) sieve instead of counting primes but the principles involved are the same. Download and run the .cpp to see the influence of segment sizes. Basically, the optimum should be around the L1 cache size for your CPU. Too small, and the overhead due to increased number of rounds starts dominating; too big, and you'll get penalised by the slower timings of the L2 and L3 caches. See also How does segmentation improve the running time of Sieve of Eratosthenes?.

void initialise_packed_sieve_4G (void *data, unsigned segment_bytes = 1 << 15, unsigned end_bit = 1u << 31)
{
   typedef std::vector<prime_and_offset_t>::iterator prime_iter_t;
   std::vector<prime_and_offset_t> small_factors;

   initialise_odd_primes_and_offsets_64K(small_factors);

   unsigned segment_bits = segment_bytes * CHAR_BIT;
   unsigned partial_bits = end_bit % segment_bits;
   unsigned segments     = end_bit / segment_bits + (partial_bits != 0);

   unsigned char *segment = static_cast<unsigned char *>(data);
   unsigned bytes = segment_bytes;

   for ( ; segments--; segment += segment_bytes)
   {
      if (segments == 0 && partial_bits)
      {
         segment_bits = partial_bits;
         bytes = (partial_bits + CHAR_BIT - 1) / CHAR_BIT;
      }

      std::memset(segment, 0, bytes);

      for (prime_iter_t p = small_factors.begin(); p != small_factors.end(); ++p)
      {
         unsigned n = p->prime;
         unsigned i = p->next_offset;

         for ( ; i < segment_bits; i += n)
         {
            set_bit(segment, i);
         }

          p->next_offset = i - segment_bits;
      }
   }
}

If the offsets were not remembered from segment to segment then they would have to be recomputed every time using at least one division and one multiplication per re-computed index, plus conditionals or serious bit trickery. When sieving the full 2^32 number range (8192 segments of 32 KBytes each) that's at least 53,583,872 slow divisions and the same number of somewhat faster multiplications; that's roughly one second added to the time needed for initialising a full 2^32 sieve (2^31 bits for the odds-only Eratosthenes).

Here's some actual code from one of my older sieves that uses this 'reconstitutive' math:

for (index_t k = 1; k <= max_factor_bit; ++k)
{
   if (bitmap_t::traits::bt(bm.bm, k))  continue;

   index_t n = (k << 1) + 1;     // == index_for_value(value_for_index(k) * 2) == n
   index_t i = square(n) >> 1;   // == index_for_value(square(n))

   if (i < offset)
   {
      i += ((offset - i) / n) * n;
   }

   for ( ; i <= new_max_bit; i += n)
   {
      bitmap_t::traits::bts(bm.bm, i); 
   }
}

That takes about 5.5 seconds for the full sieve (VC++); the code shown first takes only 4.5 seconds with the same compiler, or 3.5 seconds using gcc 4.8.1 (MinGW64).

Here are the gcc timings:

sieve bits = 2147483648 (equiv. number = 4294967295)

segment size    4096 (2^12) bytes ...   4.091 s   1001.2 M/s
segment size    8192 (2^13) bytes ...   3.723 s   1100.2 M/s
segment size   16384 (2^14) bytes ...   3.534 s   1159.0 M/s
segment size   32768 (2^15) bytes ...   3.418 s   1198.4 M/s
segment size   65536 (2^16) bytes ...   3.894 s   1051.9 M/s
segment size  131072 (2^17) bytes ...   4.265 s    960.4 M/s
segment size  262144 (2^18) bytes ...   4.453 s    919.8 M/s
segment size  524288 (2^19) bytes ...   5.002 s    818.9 M/s
segment size 1048576 (2^20) bytes ...   5.176 s    791.3 M/s
segment size 2097152 (2^21) bytes ...   5.135 s    797.7 M/s
segment size 4194304 (2^22) bytes ...   5.251 s    780.0 M/s
segment size 8388608 (2^23) bytes ...   7.412 s    552.6 M/s

digest { 203280221, 0C903F86, 5B253F12, 774A3204 }

Note: an additional second can be shaved from that time by a trick called 'presieving', i.e. blasting a pre-computed pattern into the bitmap instead of zeroing it at the beginning. That brings the gcc timing down to 2.1 s for the full sieve, with this hacked copy of the earlier .cpp. This trick works extremely well together with segmented sieving in cache-sized chunks.

这篇关于分段总理筛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆