在Java中实现Eratosthenes的页面分段筛 [英] Implementing the Page Segmented Sieve of Eratosthenes in Javascript

查看:107
本文介绍了在Java中实现Eratosthenes的页面分段筛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近阅读了有关Eratosthenes分段筛网的更快实现方法,用于真正的大数字.

I recently read about a faster implementation of Segmented Sieve of Eratosthenes for really big numbers.

以下是相同的实现:

function sieve(low, high) {

    var primeArray = [], ll = Math.sqrt(low), output = [];

    for (var i = 0; i < high; i++) {
        primeArray[i] = true;
    }

    for (var i = 2; i <= ll; i++) {
        if (primeArray[i]) {
            for (var j = i * i; j < high; j += i) {
                primeArray[j] = false;
            }
        }
    }

    for (var i = 2; i < ll; i++) {
        if(primeArray[i])
        {
            var segmentStart = Math.floor(low/i) * i;

            for(var j = segmentStart; j <= high; j+=i)
            {
                primeArray[j] = false;
            }
        }
    }

    for(var i = low; i <= high; i++)
    {
        if(primeArray[i])
        {
            output.push(i);
        }
    }

    return output;
};

我似乎无法弄清楚哪里弄错了. 可能工作了太久了.

I cannot seem to figure out where have I got it wrong. Probably been working at it too long.

例如:

sieve(4,10)应该返回[5,7] 但是它返回了[5,7,9]

sieve(4,10) should return [5,7] But it is returning [5,7,9]

推荐答案

尽管您从阅读中学到的是,页面分割的Eratosthenes筛网是快速查找大范围质数的快速方法. ,您的问题代码(即使已更正)也不会实现页面分段SoE,无法在很大范围内测试代码,而且随着SoE实现的进行,它也不是特别快.以下讨论将显示如何在较大范围内逐步使用真正的页面分段SoE.

Although you have gathered from your reading that a Page Segmented Sieve of Eratosthenes is a fast way of finding the primes over a large range, your question code (even when corrected) does not implement a Page Segmented SoE, test the code over a large range, nor is it particularly fast as SoE implementations go. The following discussion will show how to progress toward using a true Page Segmented SoE over a large range.

以下是逐步实现您的意图的阶段性进展,并通过注释解释了每个步骤的原因和实施细节.它包括JavaScript中的可运行代码段,但这些技术不仅限于JavaScript,其他语言也没有对某些进一步的改进施加限制,例如对结果页面进行多线程处理(Web Worker除外,Web Workers难以控制)到处理顺序),在极端循环展开方面进行了一些进一步的优化,这最后与代码效率有限有关,这是因为必须由浏览器中的JavaScript引擎将即时(JIT)编译为本地代码;这些限制与直接编译为非常高效的本机代码(例如C/C ++,Nim,Rust,Free Pascal,Haskell,Julia等)的语言相比.

Following is a staged progression of increasingly fast implementations leading to your intention, with commentary explaining the reasons and implementation details of every step. It includes runnable snippets in JavaScript, but these techniques are not limited to just JavaScript and other languages don't place limits on some further refinements such as multi-threading of the resulting pages (other than by Web Workers, which are difficult to control as to order of processing), some further optimizations in extreme loop unrolling, and which last is related to the limited efficiency of the code due to having to be Just In Time (JIT) compiled to native code by the JavaScript engine in your browser; these limits are as compared to languages that compile directly to very efficient native code such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

首先,让我们从当前代码算法的有效版本入手,在合理的范围内使用它,并提供时序信息以建立基线.下面的代码在剔除素数的平方处开始每个素数的剔除,这避免了剔除给定素数值和一些多余的起始剔除的问题,并且没有理由为我们产生的大范围生成素数的输出数组直接从剔除阵列中获得质数;同样,答案的确定不在时间范围内,因为我们将开发更好的技术来找到答案",对于较大范围,该答案通常是找到的素数的数量,素数的总和,素数的首次出现缺口等,这些都不需要实际查看找到的素数:

First, lets start with a working version of your current code algorithm used over a reasonably large range with timing information to establish a base line; the following code starts culling per prime at the square of the culling prime which avoids the problem of culling the given prime values and some redundant starting culls and there is no reason to generate an output array of resulting primes for the large range as we can produce the primes directly from the culling array; also, the determination of the answer is outside the timing because we will be developing better techniques to find an "answer" which for large ranges is usually the count of the number of primes found, the sum of the primes, the first occurrences of prime gaps, etc., none of which need to actually view the found primes:

"use strict";

function soePrimesTo(limit) {
  var sz = limit - 1;
  var cmpsts = new Uint8Array(sz); // index 0 represents 2; sz-1 is limit
  // no need to zero above composites array; zeroed on creation...
  for (var p = 2; ; ++p) {
    var sqr = p * p;
    if (sqr > limit) break; // while p is the square root of limit -> cull...
    if (cmpsts[p - 2] == 0 >>> 0) // 0/1 is false/true; false means prime...
      for (var c = sqr - 2; c < sz; c += p) // set true for composites...
          cmpsts[c] = 1 >>> 0; // use asm.js for some extra efficiency...
  }
  var bi = 0
  return function () {
    while (bi < sz && cmpsts[bi] != 0 >>> 0) ++bi;
    if (bi >= sz) return null;
    return bi++ + 2;
  };
}

// show it works...
var gen = soePrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soePrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

现在,引用上述代码过筛到十亿的运行时间几乎没有用,因为您肯定可以加快速度,因为我在运行于1.92 GHz的Intel x5-Z8350中使用了非常低端的平板电脑CPU (单个活动线程的CPU时钟速度).我将仅引用一次执行时间作为计算每个剔除的平均CPU时钟周期的示例:我将上述代码的执行时间大约为43350毫秒乘以43.35秒乘以每秒19.2亿个时钟除以大约2.514十亿次筛选操作以筛选到十亿美元(可以从公式或无轮分解的表格中计算 ),从而使每个剔除 33.1个CPU时钟周期达到十亿.从现在开始,我们将仅使用每个剔除的CPU时钟周期来比较性能.

Now, there is little purpose in quoting my run times for the above code sieving to a billion as your times will almost certainly be faster as I am using a very low end tablet CPU in the Intel x5-Z8350 running at 1.92 Gigahertz (CPU clock speed for a single active thread). I am going to quote my execution time exactly once as an example of how to calculate average CPU clock cycles per cull: I take the execution time for the above code of about 43350 milliseconds is 43.35 seconds times 1.92 billion clocks per second divided by about 2.514 billion culling operations to sieve to a billion (which can be calculated from the formula or from the table for no wheel factorization on the SoE Wikipedia page) to get about 33.1 CPU clock cycles per cull to a billion. From now on, we will only use CPU clock cycles per cull to compare performance.

请进一步注意,这些性能得分在很大程度上取决于所使用的浏览器JavaScript引擎,以上得分是在Google Chrome浏览器(版本72)上运行的;对于我们正在朝着的Page Segmented SoE版本而言,Microsoft Edge(版本44)要慢大约七倍,并且Firefox和Safari的性能可能接近于Chrome.

Further note that these performance scores are very dependent on the browser JavaScript engine used, with the above score as run on Google Chrome (version 72); Microsoft Edge (version 44) is about seven times slower for the version of Page Segmented SoE towards which we are moving, and Firefox and Safari are likely to be close to Chrome in performance.

由于使用了Uint8Array TypedArray和更多的asm.js,因此此性能可能比以前的答案代码更好,但是此类一个巨大的数组筛子"(此处使用的千兆字节内存)的时间受到瓶颈的影响.由于主RAM内存的内存访问速度超出了CPU缓存限制.这就是为什么我们要努力实现页面分段筛"的原因,但首先让我们做些关于减少内存使用量和所需剔除周期数的事情.

This performance is likely better than for the previous answers code due to using the Uint8Array TypedArray and more asm.js but the timing for such "one huge array sieves" (a GigaByte of memory used here) are subject to bottlenecks due to memory access speed for main RAM memory outside the CPU cache limits. This is why we are working toward a Page Segmented Sieve, but first let's do something about reducing the amount of memory used and the number of required culling cycles.

以下代码进行位打包,这在紧密的内部剔除循环中要求稍微复杂一些,但是由于考虑到每个复合数字仅使用一位,因此将内存使用量减少了八倍;同样,因为两个是唯一的偶数素数,所以它仅使用基本的轮分解来筛选赔率,这仅是将内存使用量进一步减少了两倍,而剔除操作的次数也减少了约2.5.

The following code does bit packing, which requires slightly more complexity in the tight inner culling loop, but as it uses just one bit per composite number considered, it reduces the memory use by a factor of eight; as well, since two is the only even prime, it uses basic wheel factorization to sieve odds only for a further reduction of memory use by a factor of two and a reduction in number of culling operations by a factor of about 2.5.

仅赔率的最小车轮分解如下:

This minimal wheel factorization of odds-only works as follows:

  1. 我们制作了一个转轮",上面有两个位置,一半在偶数上命中",另一半在奇数上命中;因此,当我们在所有潜在质数上滚动"时,该轮子"的圆周跨度为两个数字,但是在滚动"的两个或一半的数字中,它仅窗口"一个.
  2. 然后,我们将滚动"的所有数字划分为两个连续值的位平面,在一个位平面中,我们将所有偶数从4开始放置,在另一个位平面中,我们将所有的奇数开始放置三点钟.
  3. 现在我们丢弃偶数平面,因为没有一个表示的数字可以是素数.
  4. 对于考虑的奇数基数素数,p * p的起始索引始终是奇数,因为将奇数乘以奇数始终是奇数.
  5. 当我们通过质数将索引增加到奇数位平面时,实际上我们将值相加两倍,因为添加一个奇数和一个奇数会产生一个偶数,这将在我们丢弃的位平面中,因此我们将再次使用质数,以再次返回到奇数位平面.
  6. 奇数位平面索引位置自动解决了这一问题,因为我们已经舍弃了先前在每个奇数位位置索引之间的所有偶数值.
  7. 这就是为什么我们可以通过在以下代码中反复向索引添加质数来进行剔除的原因.
  1. We make a "wheel" with two positions on it, one half that "hits" on the even numbers, and the other half that hits on the odd numbers; thus this "wheel" has a circumferential span of two numbers as we "roll" it over all the potential prime numbers, but it only "windows" one out of two or half of the numbers over which it "rolls".
  2. We then divide all of numbers over which we "roll" into two bit planes of successive values, in one bit plane we place all of the evens starting at four, and in the other bit plane we place all of the odds starting at three.
  3. Now we discard the even plane as none of the represented numbers there can be prime.
  4. The starting index of p * p for considered odd base primes always is an odd number because multiplying an odd number by an odd number is always an odd number.
  5. When we increment the index into the odd bit plane by the prime value, we are actually adding two times the value since adding an odd and an odd produces an even, which would be in the bit plane we discarded, so we add the prime value again to get back to the odd bit plane again.
  6. The odd bit plane index positions automatically account for this since we have thrown away all the even values that were previously between each odd bit position index.
  7. That is why we can cull by just repeatedly adding prime to the index in the following code.

"use strict";

function soeOddPrimesTo(limit) {
  var lmti = (limit - 3) >> 1; // bit index for limit value
  var sz = (lmti >> 3) + 1; // size in bytes
  var cmpsts = new Uint8Array(sz); // index 0 represents 3
  // no need to zero above composites array; zeroed on creation...
  for (var i = 0; ; ++i) {
    var p = i + i + 3; // the square index is (p * p - 3) / 2 but we
    var sqri = (i << 1) * (i + 3) + 3; // calculate start index directly 
    if (sqri > lmti) break; // while p is < square root of limit -> cull...
    // following does bit unpacking to test the prime bit...
    // 0/1 is false/true; false means prime...
    // use asm.js with the shifts to make uint8's for some extra efficiency...
    if ((cmpsts[i >> 3] & ((1 >>> 0) << (i & 7))) == 0 >>> 0)
      for (var c = sqri; c <= lmti; c += p) // set true for composites...
          cmpsts[c >> 3] |= (1 >>> 0) << (c & 7); // masking in the bit
  }
  var bi = -1
  return function () { // return function to return successive primes per call...
    if (bi < 0) { ++bi; return 2 } // the only even prime is a special case
    while (bi <= lmti && (cmpsts[bi >> 3] & ((1 >>> 0) << (bi & 7))) != 0 >>> 0) ++bi;
    if (bi > lmti) return null; // return null following the last prime
    return (bi++ << 1) + 3; // else calculate the prime from the index
  };
}

// show it works...
var gen = soeOddPrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soeOddPrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

现在的性能大约为:10.257亿次剔除操作,每剔除操作要花费 34.75 CPU时钟周期(仅来自Wikipedia),这意味着时间的减少几乎全部减少了由于减少了剔除操作的数量,位纠缠"位打包的额外复杂性只需要大约与由于内存使用减少16倍而节省的时间相同的额外时间.因此,此版本使用内存的十六分之一,并且快约2.5倍.

The performance is now about 34.75 CPU clock cycles per cull operation for 1.0257 billion culling operations for a range of a billion for odds only (from Wikipedia), meaning that the reduction in time is almost entirely due to the reduction in number of culling operations, with the extra complexity of the "bit-twiddling" bit packing only taking about the same amount of extra time as the savings due to the reduction in memory use by a factor of sixteen. Thus, this version uses one sixteenth the memory and is about 2.5 times faster.

但是我们还没有完成,页面分割可以进一步加快我们的速度,就像您的消息来源告诉您的那样.

But we aren't done yet, page segmentation can speed us up even more just as your source told you.

那么什么是页面细分应用于SoE?它对我们有什么作用?

So what is Page Segmentation applied to the SoE and what does it do for us?

Page Segmentation将筛分工作从一次筛分的庞大数组划分为一系列连续筛分的较小页面.然后,这就需要更多的复杂性,因为必须有单独的基础素数流,这可以通过使用内部筛网递归筛分生成可记忆的基础素数列表的方法来获得,以供主筛子使用.同样,输出结果的生成通常要稍微复杂一点,因为它需要对每个生成的筛选页面进行连续扫描和缩小.

Page Segmentation is dividing the work of sieving from one huge array sieved at once to a series of smaller pages that are sieved successively. It then requires a little more complexity in that there must be a separate stream of base primes available, which can be obtained by sieving recursively with an inner sieve generating the list of memoized base primes to be used by the main sieve. As well, the generation of output results is generally a little more complex as it involves successive scanning and reduction for each of the generated sieved pages.

页面细分具有以下优点:

Page Segmentation has many benefits as follows:

  1. 它进一步减少了内存需求.如果使用以前的仅包含奇数的代码来筛选十亿个字节,则需要大约64 MB的RAM,但是由于无法使用该算法来筛选的范围超出了16到320亿(即使我们要等待很长时间),因此使用JavaScript可用的所有内存.通过筛分(例如)该数量的平方根的页面大小,我们可以筛选出该数量平方或超出我们有时间等待的任何值的范围.我们还必须将基本素数存储到所需范围的平方根,但是对于我们要考虑的任何范围,这只有几十兆字节.
  2. 它提高了内存访问速度,这直接影响每个剔除操作的CPU敲击周期数.如果剔除操作大部分发生在CPU缓存中,则内存访问将从每次访问主RAM内存的数十个CPU时钟周期(取决于CPU以及RAM的质量和类型)变为CPU L1缓存的大约一个时钟周期(较小)从CPU L2高速缓存(大约至少128 KiloByte的高速缓存)(大约16 Kbyte的字节)到大约八个时钟周期,我们可以制定出剔除算法,以便我们使用所有高速缓存来发挥其最大的优势,而小型快速L1高速缓存用于大多数操作具有较小的基本素数,较大的位慢一点的L2高速缓存用于中型基本素数,并且仅使用主RAM进行少量操作,而使用较大的基本素数则具有非常大的范围.
  3. 通过将每个较大的页面筛选工作分配到不同的CPU内核以实现相当粗粒度的并发性,它打开了多线程的可能性,尽管这不适用于JavaScript,除非通过使用Web Workers(混乱) ).

除了增加的复杂性之外,页面分段还有另一个要解决的问题:与一个巨大的数组"筛子不同,后者可以轻松地一次计算开始索引,然后将其用于整个数组,分段筛子需要开始地址可以通过每页每素数的模(除)运算来计算(通常是昂贵的),或者需要使用额外的内存来存储每页每基数素数所达到的索引,因此不必重新计算起始索引,但这是最后一种技术这些数组不同步时,将阻止多线程.在Ultimate版本中将使用的最佳解决方案是结合使用这些技术,将多个页面段组合在一起以形成一个相当大的线程工作单元,以便这些地址计算只占总时间的一小部分,索引存储表用于每个线程在这些较大工作单元中的基本素数,因此复杂的计算仅需在每个较大工作单元中进行一次.因此,我们既获得了多线程的可能性,又减少了开销.但是,下面的代码并不能减少这种开销,当花费十亿美元时,它的成本大约为10%到20%.

Other than the added complexity, Page Segmentation has one other problem to work around: unlike the "one huge array" sieve where start indexes are calculated easily once and then used for the entire array, segmented sieves require either that the start addresses be calculated by modulo (division) operations per prime per page (computationally expensive), or additional memory needs to be used to store the index reached per base prime per page so the start index doesn't have to be recalculated, but this last technique precludes multi-threading as these arrays get out of sync. The best solution that will be used in the Ultimate version is to use a combination of these techniques, where several page segments are grouped to form a reasonably large work unit for threading so that these address calculations take an acceptably small portion of the total time, and the index storage tables are used for the base primes across these larger work units per thread so that the complex calculations need only be done once per larger work unit. Thus we get both the possibility of multi-threading and a reduced overhead. However, the following code doesn't reduce this overhead, which costs about 10% to 20% when sieving to a billion.

除了页面分段之外,下面的代码通过使用一次计数32位的计数查找表(CLUT)填充计数算法来添加对找到的素数的有效计数,以便连续查找结果的开销找到的质数仅占筛分时间的一小部分.如果不这样做,则枚举单个找到的素数只是为了确定有多少素数,至少要花费给定范围内筛分的时间.可以很容易地开发出类似的快速例程来完成诸如求和,求等距之类的事情.

In addition to Page Segmentation, the following code adds an efficient counting of the found primes by using an Counting Look Up Table (CLUT) population counting algorithm that counts 32 bits at a time so that the overhead of continuously finding the result of the number of found primes a a small percentage of the sieving time. If this were not done, enumerating the individual found primes just to determine how many there are would take at least as long as it takes to sieve over a given range. Similar fast routines can easily be developed to do such things as sum the found primes, find prime gaps, etc.

START_

以下代码增加了另一个速度:对于较小的质数(在此优化有效的情况下),该代码通过识别出剔除操作遵循八步模式来进行某种形式的循环分离.这是因为一个字节具有偶数个位数,并且我们通过奇数素数进行剔除,该奇数素数将每八个剔除返回到字节中的相同位位置;这意味着对于每个位位置,我们都可以简化内部剔除循环以掩盖恒定的位,从而极大简化内部剔除循环,并且由于图案中的每个剔除循环都不需要进行剔除,因此剔除的速度提高了大约两倍. 位旋转"位打包操作.这项更改节省了大约35%的执行时间,节省了十亿美元.可以通过将64更改为0来禁用它.由于这种模式,在使用本机代码编译器时,可以将剔除操作的速度提高大约两倍,从而为八个循环的本机代码极端展开奠定了基础.

The following code adds one other speed-up: for smaller primes (where this optimization is effective), the code does a form of loop separation by recognizing that the culling operations follow a eight step pattern. This is because a byte has an even number of bits and we are culling by odd primes that will return to the same bit position in a byte every eight culls; this means that for each of the bit positions, we can simplify the inner culling loop to mask a constant bit, greatly simplifying the inner loop and making culling up to about two times faster due to each culling loop in the pattern not needing to do the "bit-twiddling" bit packing operations. This change saves about 35% of the execution time sieving to a billion. It can be disabled by changing the 64 to 0. It also sets the stage for the native code extreme unrolling of the eight loop due to this pattern that can increase the culling operation speed by another factor of about two when native code compilers are used.

通过对掩码值(而不是向左移位)使用查找表(LUT),可以对较大的质数(大于8192)进行更快的循环,以节省每个筛选操作大约一半的CPU时钟周期平均淘汰范围为十亿个;随着范围从10亿起,这种节省将略有增加,但在JavaScript中效果不佳,因此已被删除.

A further minor modification makes the loop faster for the larger primes (greater than 8192) by using a Look Up Table (LUT) for the mask value rather than the left shift operation to save about half a CPU clock cycle per culling operation on average when culling a range of a billion; this saving will be increased slightly as ranges go up from a billion but isn't that effective in JavaScript and has been removed.

END_EDIT

ANOTHER_

以及上面的编辑,我们已经删除了LUT BITMASK,但现在通过从相同大小的零缓冲区进行快速字节复制来将筛分缓冲区归零,并添加了Counting LUT填充计数技术,总体收益约为10%速度.

As well as the above edits, we have removed the LUT BITMASK but now zero the Sieve Buffer by fast byte copying from a zero buffer of the same size, and added a Counting LUT population count technique for about a 10% overall gain in speed.

END_ANOTHER_EDIT

// JavaScript implementation of Page Segmented Sieve of Eratosthenes...
// This takes almost no memory as it is bit-packed and odds-only,
// and only uses memory proportional to the square root of the range;
// it is also quite fast for large ranges due to imrproved cache associativity...

"use strict";

const ZEROSPTRN = new Uint8Array(16384);

function soePages(bitsz) {
  let len = bitsz >> 3;
  let bpa = [];
  let buf =  new Uint8Array(len);
  let lowi = 0;
  let gen;
  return function () {
    let nxt = 3 + ((lowi + bitsz) << 1); // just beyond the current page
    buf.set(ZEROSPTRN.subarray(0,buf.length)); // clear sieve array!
    if (lowi <= 0 && bitsz < 131072) { // special culling for first page as no base primes yet:
      for (let i = 0, p = 3, sqr = 9; sqr < nxt; ++i, p += 2, sqr = p * p)
        if ((buf[i >> 3] & (1 << (i & 7))) === 0)
          for (let j = (sqr - 3) >> 1; j < 131072; j += p)
            buf[j >> 3] |= 1 << (j & 7);
    } else { // other than the first "zeroth" page:
      if (!bpa.length) { // if this is the first page after the zero one:
        gen = basePrimes(); // initialize separate base primes stream:
        bpa.push(gen()); // keep the next prime (3 in this case)
      }
      // get enough base primes for the page range...
      for (let p = bpa[bpa.length - 1], sqr = p * p; sqr < nxt;
           p = gen(), bpa.push(p), sqr = p * p);
      for (let i = 0; i < bpa.length; ++i) { // for each base prime in the array
        let p = bpa[i] >>> 0;
        let s = (p * p - 3) >>> 1; // compute the start index of the prime squared
        if (s >= lowi) // adjust start index based on page lower limit...
          s -= lowi;
        else { // for the case where this isn't the first prime squared instance
          let r = (lowi - s) % p;
          s = (r != 0) ? p - r : 0;
        }
        if (p <= 32) {
          for (let slmt = Math.min(bitsz, s + (p << 3)); s < slmt; s += p) {
            let msk = ((1 >>> 0) << (s & 7)) >>> 0;
            for (let c = s >>> 3, clmt = bitsz >= 131072 ? len : len; c < clmt | 0; c += p)
              buf[c] |= msk;
          }
        }
        else
          // inner tight composite culling loop for given prime number across page
          for (let slmt = bitsz >= 131072 ? bitsz : bitsz; s < slmt; s += p)
            buf[s >> 3] |=  ((1 >>> 0) << (s & 7)) >>> 0;
      }
    }
    let olowi = lowi;
    lowi += bitsz;
    return [olowi, buf];
  };
}

function basePrimes() {
  var bi = 0;
  var lowi;
  var buf;
  var len;
  var gen = soePages(256);
  return function () {
    while (true) {
      if (bi < 1) {
        var pg = gen();
        lowi = pg[0];
        buf = pg[1];
        len = buf.length << 3;
      }
      //find next marker still with prime status
      while (bi < len && buf[bi >> 3] & ((1 >>> 0) << (bi & 7))) bi++;
      if (bi < len) // within buffer: output computed prime
        return 3 + ((lowi + bi++) << 1);
      // beyond buffer range: advance buffer
      bi = 0;
      lowi += len; // and recursively loop to make a new page buffer
    }
  };
}

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
    let nmbts = 0|0; let v = i;
    while (v > 0) { ++nmbts; v &= (v - 1)|0; }
    arr[i] = nmbts|0;
  }
  return arr;
}();

function countPage(bitlmt, sb) {
  let lst = bitlmt >> 5;
  let pg = new Uint32Array(sb.buffer);
  let cnt = (lst << 5) + 32;
  for (let i = 0 | 0; i < lst; ++i) {
    let v = pg[i]; cnt -= CLUT[v & 0xFFFF]; cnt -= CLUT[v >>> 16];
  }
  var n = pg[lst] | (0xFFFFFFFE << (bitlmt & 31));
  cnt -= CLUT[n & 0xFFFF]; cnt -= CLUT[n >>> 16];
  return cnt;
}

function countSoEPrimesTo(limit) {
  if (limit < 3) {
    if (limit < 2) return 0;
    return 1;
  }
  var cnt = 1;
  var lmti = (limit - 3) >>> 1;
  var lowi;
  var buf;
  var len;
  var nxti;
  var gen = soePages(131072);
  while (true) {
    var pg = gen();
    lowi = pg[0];
    buf = pg[1];
    len = buf.length << 3;
    nxti = lowi + len;
    if (nxti > lmti) {
      cnt += countPage(lmti - lowi, buf);
      break;
    }
    cnt += countPage(len - 1, buf);
  }
  return cnt;
}

var limit = 1000000000; // sieve to this limit...
var start = +new Date();
var answr = countSoEPrimesTo(limit);
var elpsd = +new Date() - start;
console.log("Found " + answr + " primes up to " + limit + " in " + elpsd + " milliseconds.");

如此处所实现的那样,该代码每剔除大约需要 13.8个CPU时钟周期,筛选出的代码多达10亿个.当使用以下最大轮子分解算法时,由于改进了地址计算算法而自动节省了约20%的额外费用,这是因为有效页面大小增加了105倍,因此该开销仅占百分之几,与百分之几相当用于数组填充和结果计数.

As implemented here, that code takes about 13.8 CPU clock cycles per cull sieving to a billion. The extra about 20% saving by improving the address calculation algorithm is automatically improved when using the following maximum wheel factorization algorithm due to the effective page sized increasing by a factor of 105 so that this overhead becomes only a few percent and comparable to the few percent used for array filling and result counting.

现在,我们考虑了更广泛的更改,以使用最大车轮分解系数(对于仅奇数",不仅是2,对于覆盖210个潜在质数而不是跨度的车轮,还有3、5和7)只需2)并预先筛选小筛分阵列的初始化,这样就不必剔除以下11、13、17和19的质数.这减少了使用时的复合数剔除操作的数量.页面分割的筛网大约是四分之一到十亿分之一(如表中所示/根据Wikipedia文章中的公式-组合轮"计算得出),并且可以进行编写,使其运行速度大约是原来的四倍缩减操作,每个剔除操作的速度与上述代码相同.

Now we consider the more extensive changes to use maximum wheel factorization (by not only 2 for "odds-only", but also 3, 5, and 7 for a wheel that covers a span of 210 potential primes instead of a span of just 2) and also pre-culling on initialization of the small sieving arrays so that it is not necessary to cull by the following primes of 11, 13, 17, and 19. This reduces the number of composite number culling operations when using the page segmented sieve by a factor of about four to a range of a billion (as shown in the tables/calculated from the formulas in the Wikipedia article - "combo wheel") and can be written so that it runs about four times as fast due to the reduced operations with each culling operation about the same speed as for the above code.

有效地进行210跨度轮分解的方法是遵循仅奇数"方法:如上一章所述,可以将上述当前算法视为将一个位填充平面从两个中筛选出来可以消除另一个平面,因为它只包含两个以上的偶数;对于210跨度,我们可以定义48个这种大小的位数组,表示11或以上的可能素数,其中所有其他162平面包含因数为2、3、5或7的数字,因此不需要被考虑.然后,仅通过基数素数的重复索引就可以剔除每个位平面,就像奇数平面通过结构自动处理乘法一样(仅针对奇数).这样,以更少的内存要求(与仅奇数"为1/2相比,通过48/210进行筛选)以及与仅对奇数(即一个48平面页面"的效率)一样高的效率进行筛选同样有效.代表每平面大小16 KB = 131072位乘以210是每个筛分页段27,525,120个数字的范围,因此只有几乎40个页段筛分到十亿个(而不是上面的近四千个),因此开销较小计算每个页面段每个基数的起始地址,以进一步提高效率.

The way of doing the 210-span wheel factorization efficiently is to follow on the "odds-only" method: the current algorithm above can be thought of as sieving one bit-packed plane out of two as explained in the previous chapter where the other plane can be eliminated as it contains only the even numbers above two; for the 210-span we can define 48 bit-packed arrays of this size representing possible primes of 11 and above where all the other 162 planes contain numbers which are factors of two, three, five, or seven and thus don't need to be considered. Each bit plane can than be culled just by repeated indexing in increments by the base prime just as the odd number plane was done with the multiplications automatically being taken care of by the structure just as for odds only. In this way it is just as efficient to sieve with less memory requirements (by 48/210 as compared to "odds-only" at 1/2) and with as much efficiency as for odds only, where one 48-plane "page" represents for a 16 Kilobytes = 131072 bits per plane size times 210 is a range of 27,525,120 numbers per sieve page segment, thus only almost 40 page segments to sieve to a billion (instead of almost four thousand as above), and therefore less overhead in start address calculation per base prime per page segment for a further gain in efficiency.

尽管上面描述的扩展代码有几百行,并且要在此处发布很长的时间,但是在使用Google V8 JavaScript引擎的我的低端Intel 1.92 Gigahertz CPU上,它可以在大约两秒钟内将素数计算为十亿,这比以本机代码运行的相同算法慢大约五倍.

Although the extended code described above is a couple of hundred lines and long to post here, it can count the number of primes to a billion in about two seconds on my low end Intel 1.92 Gigahertz CPU using the Google V8 JavaScript engine, which is about five times slower than the same algorithm run in native code.

尽管上面的代码在大约160亿的范围内是非常有效的,但是其他改进可以帮助将效率保持在甚至几千亿的更大范围,例如1e14或更高.我们通过向上调整有效页面大小来实现此目的,以使它们永远不会小于被筛分的整个范围的平方根,而对于较小的质数,按16 KiloByte块递增筛选,对于中等质数的则按128 KiloByte块递增筛选,而仅对根据我们的基准实施方案,对于用于最大基本素数的极少数剔除操作,阵列数量庞大.这样,对于我们可能会考虑的最大范围,我们的每只剔除时钟不会增加最多约2的小倍.

Although the above code is quite efficient up to a range of about 16 billion, other improvements can help maintain the efficiency to even larger ranges of several tens of thousands of billions such as 1e14 or more. We accomplish this by adjusting the effective page sizes upward so that they are never smaller than the square root of the full range being sieved, yet sieve them incrementally by 16 KiloByte chunks for small primes, 128 KiloByte chunks for medium primes, and only sieve the a huge array as per our baseline implementation for the very few culling operations used for the largest base prime sizes. In this way, our clocks per cull doesn't increase more than a small factor of up to about two for the largest range we would likely consider.

由于此答案接近30000个字符的有限大小,因此 我的后续跟进将继续进行有关最大车轮分解的进一步讨论.第4.5a章 和(以后)第4.5b章回答了上述技术的实际实现.

As this answer is close to the limited size of 30,000 characters, further discussion on Maximal Wheel Factorization is continued on my followup Chapter 4.5a and (future) Chapter 4.5b answers for actual implementations of the techniques described above.

对于JavaScript和其他虚拟机语言,最小剔除循环时间约为每个剔除循环10个CPU周期,并且变化不大.这比使用相同算法(例如C/C ++,Nim,Rust,免费Pascal,Haskell,Julia等.

For JavaScript and other Virtual Machine languages, the minimum culling loop time is in the order of ten CPU cycles per culling loop and that is not likely to change much. This is about three to four times slower than the about three CPU clock cycles that can easily be achieved with languages that compile directly to efficient machine code using the same algorithm such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

此外,至少有一些语言可以使用极限循环展开技术,这些语言可以使平均剔除操作周期减少 JavaScript拒绝的进一步两倍 strong>.

In addition, there are extreme loop unrolling techniques that can be used with at least some of those languages that can produce a reduction in average culling operation cycles by a further factor of about two that is denied to JavaScript.

多线程可以将执行时间减少大约所使用的有效CPU内核的倍数,但是使用JavaScript的唯一方法是通过使用Web Workers来实现,而且同步也很麻烦.在我的机器上,我有四个核心,但是由于所有核心都处于活动状态时,CPU时钟速率降低到四分之三,因此速度只能提高三倍. 这三分之二的JavaScript并不容易.

Multi-threading can reduce the execution time by about the factor of effective CPU cores used, but with JavaScript the only way to get this is though the use of Web Workers and that is messy to synchronize. On my machine, I have four cores but only gain a factor of three in speed due the the CPU clock rate being reduced to three quarters with all cores active; this factor of three is not easy to gain for JavaScript.

因此,这是关于使用JavaScript的最新技术,其他当前VM语言具有相同的局限性,但它们可以轻松使用多线程,并且结合了上述因素意味着本机代码编译器的速度可以比JavaScript快20倍左右(包括多线程,甚至在具有巨大内核数的新CPU上也可以更快).

So this is about the state-of-the-art using JavaScript and other current VM languages have about the same limitation other than they can easily use multi-threading, with the combination of the above factors meaning native code compilers can be about twenty times faster than JavaScript (including multi-threading, and even more with new CPU's with huge numbers of cores).

但是,我相信三到五年内Web编程的未来将是Web Assembly,这有可能克服所有这些限制.现在,它非常接近支持多线程,尽管目前在Chrome上该算法仅比JavaScript快30%,但是当使用某些Web使用某些语言从某些语言编译时,它仅比某些当前浏览器中的本机代码慢一点.汇编编译器.对于Web Assembly的高效编译器和对本机代码的有效浏览器编译,仍处于发展初期,但是由于Web Assembly与大多数VM相比更接近于本机代码,因此可以很容易地对其进行改进,以生成速度几乎相同或接近于本机的代码.速度与其他Notive代码编译语言中的代码一样快.

However, I believe that the future of Web programming in three to five years will be Web Assembly, and that has the potential to overcome all of these limitations. It is very near to supporting multi-threading now, and although currently only about 30% faster than JavaScript for this algorithm on Chrome, it is up to only a little slower than native code on some current browsers when compiled from some languages using some Web Assembly compilers. It is still early days of development for efficient compilers to Web Assembly and for efficient browser compilation to native code, but as Web Assembly is closer to native code than most VM's, it could easily be improved to produce native code that is as fast or nearly as fast as code from those other notive-code compiling languages.

但是,除了将JavaScript库和框架编译为Web Assembly之外,我不认为Web的未来将是JavaScript到Web Assembly编译器,而是从其他某种语言进行编译.对于Web编程的未来,我最喜欢的选择是F#,也许将Fable实现转换为可生成Web Assembly而不是JavaScript(asm.js)或Nim.甚至有可能产生Web组件,以支持并显示极端循环展开的优势,从而非常接近已知的页面分段SoE速度中最快的速度.

However, other than for compilation of JavaScript libraries and Frameworks into Web Assembly, I don't believe the future of the Web will be JavaScript to Web Assembly compilers, but rather compilation from some other language. My favourite picks of the future of Web programming is F# with perhaps the Fable implementation converted to produce Web Assembly rather than JavaScript (asm.js) or Nim. There is even some possibility that Web Assembly could be produced that supports and shows the advantage of the extreme loop unrolling for very close to the fastest of known Page Segmented SoE speeds.

我们已经用JavaScript构建了Eratosthenes的页面分段筛网,该筛网适用于筛分数十亿的大范围样品,并具有进一步扩展这项工作的方法.生成的代码的剔除操作大约减少了十倍(完全轮分解时),剔除操作快了大约三倍,这意味着每个给定(较大)范围的代码快了大约30倍,而减少的内存使用意味着人们可以筛选大约9e15的53位尾数的范围大约在一年左右(只需打开Chrome浏览器标签并备份电源).

We have built a Page Segmented Sieve of Eratosthenes in JavaScript that is suitable for sieving large ranges in the billions, and have a means to further extend this work. The resulting code has about ten times less culling operations (when fully wheel factorized) and the culling operations are about three times faster meaning the code per given (large) range is about 30 times faster while the reduced memory use means that one can sieve to ranges of the 53 bit mantissa of about 9e15 in something of the order of a year (just leave a Chrome browser tab open and back up the power supply).

尽管还有其他一些细微的调整,但这是关于使用JavaScript筛选质数的最新技术:尽管由于给定的原因其速度不如本机代码快,但它足以找到所需的数量.一两天(即使在中档智能手机上)也可以通过在所需的计算时间内保持打开浏览器选项卡的方式将其填充为1e14;这非常令人惊讶,因为直到1985年才知道该范围内的质数,然后通过使用数值分析技术而不是使用筛子,因为当时的计算机还不够快,无法使用最快的编码技术来做到这一点在合理且经济的时间内.尽管我们可以使用这些算法在短短几个小时内完成此操作,以获得现代台式计算机上最佳的本机代码编译器,但现在我们可以在可接受的时间内使用智能手机上的JavaScript来完成此操作!

Although there are some other minor tweaks possible, this is about the state of the art in sieving primes using JavaScript: while it isn't a fast as native code for the given reasons, it is fast enough to find the number of primes to 1e14 in a day or two (even on a mid range smartphone) by leaving a browser tab open for the required computation time; this es quite amazing as the number of primes in this range wasn't known until 1985 and then by using numerical analysis techniques and not by using a sieve as the computers of that day weren't fast enough using the fastest coding techniques to do it in a reasonable and economic amount of time. Although we can do this just in just a few hours using these algorithms for the best of the native code compilers on modern desktop computers, now we can do it in an acceptable time with JavaScript on a smart phone!

这篇关于在Java中实现Eratosthenes的页面分段筛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆