为什么两种算法查找素数速度不同的那么多,即使他们似乎做迭代相同数量? [英] Why do two algorithms for finding primes differ in speed so much even though they seem to do the same number of iterations?
问题描述
我发现的素数,在Python两种算法。每一个的内环似乎要执行的次数相同,并且同样很简单。然而,他们中的一个需要10倍之多的其他。我的问题是:
为什么呢?这是Python的一些怪癖,可以优化掉(如何?),还是我失去了什么东西?
我解决这个问题基本上是从 http://www.spoj.pl/problems/PRIME1 / 。就我而言,我有N = 10 ** 9,增量= 10 ** 5,我想找到的N-三角洲和三角洲之间的所有素数。我也有 smallprimes
,所有的pre-制作清单素数小于或等于N的平方根的第一个算法很简单:
高清range_f1(LO,HI,smallprimes):
查找所有素数p与LO< = P< =喜。
smallprimes都是素数达喜(至少)平方根的排序列表。
喜&功放;罗可能是大的,但HI-LO + 1 miust放入一个长。
素数= []
对我的xrange(HI-LO + 1):
N = LO +我
isprime = TRUE
对p的smallprimes:
如果n%P == 0:
isprime =假
打破
如果isprime:
primes.append(N)
返回素数
调用 range_f1(N-Δ,N,smallprimes)
需要很长的时间(约10秒)。内环被称为195170次。我也有一个版本的算法替换名单COM prehension循环(这是一个我实际使用的分析;看到问题的结束),但就是没有速度更快。
第二个版本是(一个丑陋的实现)埃拉托色尼的筛子:
高清range_sieve(LO,HI,smallprimes):
参数和以前一样
# 这么难看!但如此之快!怎么样??
增量= HI-LO + 1
iamprime = [真] *三角洲#iamprime [I]罗说:无论+ i是素数
如果LO&其中; = 1:
iamprime [1 - LO] = FALSE
高清sillyfun():#对于分析和放大器;速度对比
通过
对p的smallprimes:
REM = LO%P
如果REM == 0:
iamprime [0] = FALSE
对我的xrange(P - 雷姆,三角洲,P):
iamprime [I] =假
sillyfun()
如果P> = LO和P =喜:
iamprime [P - LO] = TRUE
返回[在历数(iamprime)如果p + LO为(P,AMI)AMI]
这是约10倍的速度,只需要不到2秒。然而,内环(sillyfun())被调用259982次,超过在最后一种情况下。我茫然地解释为什么这是快的。
我想,也许,是因为第一算法的内循环包含模运算,而第二只有赋值。但是,以下似乎暗示分配不得高于模运算:
>>>从timeit进口timeit
>>> timeit(10 ** 9%2341234)
0.23445186469234613
>>> timeit(一[5000] =假,一个= [真] * 10000)
0.47924750212666822
这里的(不可读)版本,第一种算法实际上我用的:
高清range_f2(LO,HI,smallprimes):
素数= []
对我的xrange(HI-LO + 1):
N = LO +我
尝试:
(1 p在smallprimes如果n%P == 0)。接下来()
除StopIteration异常:
primes.append(N)
返回素数
下面是调用分析器的range_f2(的结果)(注意时间的数产生EX pression评估):
>>>从CPROFILE导入运行为教授
>>>教授(range_f2(N-Δ,N,SP))
在13.866 CPU秒200005函数调用
标准名称:由有序
ncalls tottime percall cumtime percall文件名:LINENO(功能)
1 0.000 0.000 13.866 13.866&其中;串GT;:1(小于模块&GT)
195170 12.632 0.000 12.632 0.000 prime1.py:101(<genexpr>)
1 1.224 1.224 13.865 13.865 prime1.py:90(range_f2)
4832 0.009 0.000 0.009 0.000 {的'名单'方法'追加'对象}
1 0.000 0.000 0.000 0.000 {法禁用的_lsprof.Profiler对象}
下面是range_sieve探查结果():
&GT;&GT;&GT;教授(range_sieve(N-Δ,N,SP))
在1.370 CPU秒259985函数调用
标准名称:由有序
ncalls tottime percall cumtime percall文件名:LINENO(功能)
1 0.003 0.003 1.370 1.370&其中;串GT;:1(小于模块&GT)
1 0.877 0.877 1.367 1.367 prime1.py:39(range_sieve)
259982 0.490 0.000 0.490 0.000 prime1.py:51(sillyfun)
1 0.000 0.000 0.000 0.000 {法禁用的_lsprof.Profiler对象}
最后,这里是他完成code,产生小的素数的列表(在一个非常愚蠢的方式),这样就可以检查什么样的结果你得到:的 http://pastebin.com/7sfN4mG4
更新按大众的需求,分析数据为code中的第一个块。多少次没有任何数据的内部循环被执行,但似乎pretty的清楚它是一样的三分之一。
&GT;&GT;&GT;教授(range_f1(N-Δ,N,SP))
在14.184 CPU秒4835函数调用
标准名称:由有序
ncalls tottime percall cumtime percall文件名:LINENO(功能)
1 0.000 0.000 14.184 14.184&其中;串GT;:1(小于模块&GT)
1 14.162 14.162 14.183 14.183 prime1.py:69(range_f1)
4832 0.021 0.000 0.021 0.000 {的'名单'方法'追加'对象}
1 0.000 0.000 0.000 0.000 {法禁用的_lsprof.Profiler对象}
所不同的是一个算法之一。在第一个版本,审判庭,您测试每个候选对所有小素数 - 你当小素数超过候选人** 0.5
不要紧,不停止在范围(10 ** 9 - 10 ** 5,10 ** 9)
如果smallprimes具有良好的上限,但它如果范围的长度是多少大相对于上界。对于复合材料,这并不会产生太大的成本,因为他们大多数都至少有一个小的素因子。但是,对于素数,你必须去一路ñ** 0.5
。有大约 10 ** 5 /日志(10 ** 9)
在该区间的质数,他们每个人的审判,除以约 10 * * 4.5 /日志(10 ** 4.5)
素数,这样就使得关于 1.47 * 10 ** 7
审判对素数师。
在另一方面,用筛子,你只划掉复合材料,每个复合被划掉了很多次,因为它有素因子,而素数都没有划掉(所以素数是免费的)。素因子数 N
是由(的倍数)的log(n)
(这是一个粗的上界束缚,往往大大高估),因此给出了一个上限 10 ** 5 *日志(10 ** 9)
(次小恒)过境点关闭,约 2 * 10 ** 6
。除此之外,该道口关闭可能比分工更小(不知道Python的,它是C数组)。所以,你正在做的工作,少用筛子。
编辑:收集到的实际号码 10 ** 9-10 ** 5
到 10 ** 9
蜱:259987
4832
部门:20353799
4832
筛只做259987过路客,你看到原油上部上述约束是一大因素高估。审判庭需要20多万师,这些素数(的16433632 X /登录X
低估素数数, X = 10 * * 4.5
10%左右),3435183都在这个范围内,其最小的质因子比Ñ**(1/3)$ C较大用于3297号$ C>。
I have two algorithms of finding primes, in Python. The inner loop of each one seems to be executed the same number of times, and is equally simple. However, one of them takes 10 times as much as the other. My question is:
Why? Is this some quirk of Python that can be optimized away (how?), or am I missing something else?
The problem I am solving is essentially from http://www.spoj.pl/problems/PRIME1/. In my case, I have N = 10 ** 9, delta = 10 ** 5, and I want to find all primes between N-delta and delta. I also have smallprimes
, a pre-made list of all primes less than or equal to square root of N. The first algorithm is very simple:
def range_f1(lo, hi, smallprimes):
"""Finds all primes p with lo <= p <= hi.
smallprimes is the sorted list of all primes up to (at least) square root of hi.
hi & lo might be large, but hi-lo+1 miust fit into a long."""
primes =[]
for i in xrange(hi-lo+1):
n = lo + i
isprime = True
for p in smallprimes:
if n % p == 0:
isprime = False
break
if isprime:
primes.append(n)
return primes
Calling range_f1(N-delta,N,smallprimes)
takes a long time (about 10 seconds). The inner loop is called 195170 times. I also have a version of this algorithm that replaces the loop with a list comprehension (That is the one I actually use for profiling; see the end of the question) but that is no faster.
The second version is (an ugly implementation of) the sieve of Eratosthenes:
def range_sieve(lo, hi, smallprimes):
"""Parameters as before"""
# So ugly! But SO FAST!! How??
delta = hi-lo+1
iamprime = [True] * delta # iamprime[i] says whether lo + i is prime
if lo<= 1:
iamprime[1 - lo] = False
def sillyfun(): # For profiling & speed comparison
pass
for p in smallprimes:
rem = lo % p
if rem == 0:
iamprime[0] = False
for i in xrange(p - rem, delta, p):
iamprime[i] = False
sillyfun()
if p >= lo and p <= hi:
iamprime[p - lo] = True
return [p + lo for (p, ami) in enumerate(iamprime) if ami]
This is about 10 times as fast, takes less than 2 seconds. However, the inner loop (sillyfun()) is called 259982 times, more than in the last case. I am at a loss to explain why this is fast.
I thought that maybe the reason is because the inner loop of the first algorithm contains modular arithmetic, while the second only has an assignment. However, the following seems to imply that assignment is no faster than modular arithmetic:
>>> from timeit import timeit
>>> timeit("10**9 % 2341234")
0.23445186469234613
>>> timeit("a[5000]=False", "a = [True] * 10000")
0.47924750212666822
Here's the (less readable) version of the first algorithm I actually use:
def range_f2(lo, hi, smallprimes):
primes =[]
for i in xrange(hi-lo+1):
n = lo + i
try:
(1 for p in smallprimes if n % p ==0).next()
except StopIteration:
primes.append(n)
return primes
Here are the result of calling the profiler for range_f2() (note the number of time generating expression is evaluated):
>>> from cProfile import run as prof
>>> prof("range_f2(N-delta,N,sp)")
200005 function calls in 13.866 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 13.866 13.866 <string>:1(<module>)
195170 12.632 0.000 12.632 0.000 prime1.py:101(<genexpr>)
1 1.224 1.224 13.865 13.865 prime1.py:90(range_f2)
4832 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Here is the profiler result for range_sieve():
>>> prof("range_sieve(N-delta,N,sp)")
259985 function calls in 1.370 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.003 0.003 1.370 1.370 <string>:1(<module>)
1 0.877 0.877 1.367 1.367 prime1.py:39(range_sieve)
259982 0.490 0.000 0.490 0.000 prime1.py:51(sillyfun)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Finally,here is he complete code that generates the lists of small primes (in a very silly way) so that you can check what results you get: http://pastebin.com/7sfN4mG4
UPDATE By popular demand, the profiling data for the first chunk of code. No data on how many times the inner loop is executed, but it seems pretty clear it's the same as the third.
>>> prof("range_f1(N-delta,N,sp)")
4835 function calls in 14.184 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 14.184 14.184 <string>:1(<module>)
1 14.162 14.162 14.183 14.183 prime1.py:69(range_f1)
4832 0.021 0.000 0.021 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The difference is an algorithmic one. In the first version, trial division, you test each candidate against all small primes - that you don't stop when the small prime exceeds candidate ** 0.5
doesn't matter for the range(10**9 - 10**5 ,10**9)
if smallprimes has a good upper bound, but it would if the length of the range were much larger in relation to the upper bound. For composites, that doesn't incur much cost, since most of them have at least one small prime divisor. But for primes, you have to go all the way to N**0.5
. There are roughly 10**5/log(10**9)
primes in that interval, each of them is trial-divided by about 10**4.5/log(10**4.5)
primes, so that makes about 1.47*10**7
trial divisions against primes.
On the other hand, with the sieve, you only cross off composites, each composite is crossed off as many times as it has prime divisors, while primes aren't crossed off at all (so primes are free). The number of prime divisors of n
is bounded by (a multiple of) log(n)
(that's a crude upper bound, usually greatly overestimating), so that gives an upper bound of 10**5*log(10**9)
(times a small constant) crossings-off, about 2*10**6
. In addition to that, the crossing off may be less work than a division (don't know about Python, it is for C arrays). So you're doing less work with the sieve.
Edit: collected the actual numbers for 10**9-10**5
to 10**9
.
Ticks: 259987
4832
Divisions: 20353799
4832
The sieve does only 259987 crossings-off, you see that the crude upper bound above is overestimating by a large factor. The trial division needs more than 20 million divisions, 16433632 of those for primes (x/log x
underestimates the number of primes, for x = 10**4.5
by about 10%), 3435183 are used for the 3297 numbers in that range whose smallest prime factor is larger than n**(1/3)
.
这篇关于为什么两种算法查找素数速度不同的那么多,即使他们似乎做迭代相同数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!