如何计算N变量的香农熵和互信息 [英] How to compute the shannon entropy and mutual information of N variables

查看:367
本文介绍了如何计算N变量的香农熵和互信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算互信息,所以要计算N个变量的香农熵.

我写了一段计算特定分布的香农熵的代码. 假设我有一个变量x,数字数组. 按照香农熵的定义,我需要计算归一化的概率密度函数,因此使用numpy.histogram很容易得到它.

import scipy.integrate as scint
from numpy import*
from scipy import*

def shannon_entropy(a, bins):

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.

return scint.simps(g,x=x)

选择插入x,并仔细选择此功能可以使用的bin号.

但是此功能非常依赖于bin编号:为此参数选择不同的值会得到不同的值.

特别是如果我的输入是一个常数数组:

x=[0,0,0,....,0,0,0]

这个变量的熵显然必须是0,但是如果我选择bin数量等于1,我得到了正确的答案,如果我选择了不同的值,我得到了奇怪的无意义(负)答案.是numpy.histogram具有参数normed = True或density = True的参数(如官方文档),他们应该归还归一化的直方图,并且当我从概率密度函数(numpy.histogram的输出)切换到概率质量函数时,可能会出现一些错误. (香农熵的输入),我这样做:

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

我想找到一种解决这些问题的方法,我想有一种有效的方法来计算与信箱数无关的香农熵.

我写了一个函数来计算更多变量分布的香农熵,但是我遇到了同样的错误. 代码是这样的,其中shannon_entropydd函数的输入是一个数组,在每个位置上都有每个变量必须参与统计计算

def intNd(c,axes):

assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
    return scint.simps(c,axes[0])
else:
    return intNd(scint.simps(c,axes[-1]),axes[:-1])



def shannon_entropydd(c,bins=30):



hist,ax=histogramdd(c,bins,normed=True)

for i in range(len(ax)):
    ax[i]=ax[i][:-1]

p=-hist*log2(hist)

p[isnan(p)]=0

return intNd(p,ax)

我需要这些数量,以便能够计算特定组之间的相互信息变量:

M_info(x,y,z)= H(x)+ H(z)+ H(y)-H(x,y,z)

其中H(x)是变量x的香农熵

我必须找到一种计算这些数量的方法,因此,如果有人使用一种完全不同的代码,可以打开它,则无需修复此代码,而是找到一种正确的方法来计算此统计数据功能!

解决方案

我认为,如果选择bins = 1,则会始终发现0的熵,因为在可能的bin上没有不确定性".值在其中(熵测量的是不确定性").您应该选择足够大"的数量,以考虑变量可以采用的值的多样性.如果您有离散值:对于二进制值,应使用bins >= 2.如果可以接受变量的值位于{0,1,2}中,则应具有bins >= 3,依此类推...

我必须说我没有阅读您的代码,但这对我有用:

import numpy as np

x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]

def entropy(c):
    c_normalized = c/float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    h = -sum(c_normalized * np.log(c_normalized))  
    return h

hx = entropy(cx)

I need to compute the mutual information, and so the shannon entropy of N variables.

I wrote a code that compute shannon entropy of certain distribution. Let's say that I have a variable x, array of numbers. Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.

import scipy.integrate as scint
from numpy import*
from scipy import*

def shannon_entropy(a, bins):

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.

return scint.simps(g,x=x)

Choosing inserting x, and carefully the bin number this function works.

But this function is very dependent on the bin number: choosing different values of this parameter I got different values.

Particularly if my input is an array of values constant:

x=[0,0,0,....,0,0,0]

the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:

p,binedg= histogram(a,bins,normed=True)
p=p/len(p)

I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.

I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error. The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation

def intNd(c,axes):

assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
    return scint.simps(c,axes[0])
else:
    return intNd(scint.simps(c,axes[-1]),axes[:-1])



def shannon_entropydd(c,bins=30):



hist,ax=histogramdd(c,bins,normed=True)

for i in range(len(ax)):
    ax[i]=ax[i][:-1]

p=-hist*log2(hist)

p[isnan(p)]=0

return intNd(p,ax)

I need these quantities in order to be able to compute the mutual information between certain set of variables:

M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)

where H(x) is the shannon entropy of the variable x

I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!

解决方案

I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...

I must say that I did not read your code, but this works for me:

import numpy as np

x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]

def entropy(c):
    c_normalized = c/float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    h = -sum(c_normalized * np.log(c_normalized))  
    return h

hx = entropy(cx)

这篇关于如何计算N变量的香农熵和互信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆