用Python计算熵的最快方法 [英] Fastest way to compute entropy in Python
问题描述
在我的项目中,我需要多次计算0-1个向量的熵.这是我的代码:
In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:
def entropy(labels):
""" Computes entropy of 0-1 vector. """
n_labels = len(labels)
if n_labels <= 1:
return 0
counts = np.bincount(labels)
probs = counts[np.nonzero(counts)] / n_labels
n_classes = len(probs)
if n_classes <= 1:
return 0
return - np.sum(probs * np.log(probs)) / np.log(n_classes)
有更快的方法吗?
推荐答案
@Sanjeet Gupta的回答很好,但可能会简短.这个问题专门询问最快"的方式,但是我只看到一个答案的时间,因此我将比较使用scipy和numpy与原始海报的entropy2答案进行了比较.
@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.
四种不同的方法: scipy/numpy , numpy/math , pandas/numpy , numpy >
Four different approaches: scipy/numpy, numpy/math, pandas/numpy, numpy
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts / n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts / counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit操作:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
时间结果:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
优胜者: numpy/math (熵2)
还值得注意的是,上面的entropy2
函数可以处理数字AND文本数据.例如:entropy2(list('abcdefabacdebcab'))
.原始海报的答案是从2013年开始的,有一个用于合并整数的特定用例,但不适用于文本.
It's also worth noting that the entropy2
function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab'))
. The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.
这篇关于用Python计算熵的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!