为什么statistics.mean()这么慢? [英] Why is statistics.mean() so slow?

查看:70
本文介绍了为什么statistics.mean()这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将statistics模块的mean函数的性能与简单的sum(l)/len(l)方法进行了比较,发现mean函数由于某些原因非常慢.我在下面的两个代码片段中使用了timeit进行比较,有人知道是什么导致执行速度的巨大差异吗?我正在使用Python 3.5.

I compared the performance of the mean function of the statistics module with the simple sum(l)/len(l) method and found the mean function to be very slow for some reason. I used timeit with the two code snippets below to compare them, does anyone know what causes the massive difference in execution speed? I'm using Python 3.5.

from timeit import repeat
print(min(repeat('mean(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

以上代码在我的计算机上执行的时间约为0.043秒.

The code above executes in about 0.043 seconds on my machine.

from timeit import repeat
print(min(repeat('sum(l)/len(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码在我的计算机上执行大约0.000565秒.

The code above executes in about 0.000565 seconds on my machine.

推荐答案

Python的statistics模块不是为速度而构建的,而是为了精度

Python's statistics module is not built for speed, but for precision

此模块的规范中,

当处理大量浮点数时,内置总和可能会失去准确性 大小不同.因此,上述幼稚的方法未能解决这个问题 酷刑测试"

The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this "torture test"

assert mean([1e30, 1, 3, -1e30]) == 1

返回0而不是1,纯计算错误为100%.

returning 0 instead of 1, a purely computational error of 100%.

在均值内使用math.fsum将使其更准确地使用float 数据,但也有将任何参数转换为 即使在不必要的情况下也可以漂浮.例如.我们应该期望列表的均值 的分数是分数,而不是浮点数.

Using math.fsum inside mean will make it more accurate with float data, but it also has the side-effect of converting any arguments to float even when unnecessary. E.g. we should expect the mean of a list of Fractions to be a Fraction, not a float.

相反,如果我们看一下本模块中_sum()的实现,则该方法的文档字符串的第一行

Conversely, if we take a look at the implementation of _sum() in this module, the first lines of the method's docstring seem to confirm that:

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    [...] """

是的,sumstatistics实现不是对Python内置sum()函数的简单单线调用,它本身需要大约20行,并且在其主体中带有嵌套的for循环

So yeah, statistics implementation of sum, instead of being a simple one-liner call to Python's built-in sum() function, takes about 20 lines by itself with a nested for loop in its body.

之所以会发生这种情况,是因为statistics._sum选择了保证它可能遇到的所有数字类型(即使它们之间的差异很大)的最大精度,而不是简单地强调速度.

This happens because statistics._sum chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.

因此,内置sum的显示速度提高了100倍似乎是正常的.精确度较低的代价是碰巧用奇异数字来称呼它.

Hence, it appears normal that the built-in sum proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.

其他选项

如果您需要优先考虑算法的速度,则应该看看 Numpy 其中的代码在C中实现.

If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.

NumPy的平均值远未达到statistics的精确度,但实现了(自2013年起)

NumPy mean is not as precise as statistics by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive sum/len (more info in the link).

但是...

import numpy as np
import statistics

np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])

print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))

> NumPy mean: 0.0
> Statistics mean: 1.0

这篇关于为什么statistics.mean()这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆