在 Python/numpy 中计算基尼系数 [英] calculating Gini coefficient in Python/numpy
问题描述
我正在计算
均匀分布应该接近完全相等",这样洛伦兹曲线就不会弯曲了.
这是意料之中的.来自均匀分布的随机样本不会产生均匀值(即彼此相对接近的值).稍加微积分,可以证明 [0, 1] 上均匀分布的样本的基尼系数的期望值(在统计意义上)是 1/3,所以得到给定样本的 1/3 左右的值是合理的.
使用诸如 v = 10 + np.random.rand(500)
之类的样本,您将获得较低的基尼系数.这些值都接近 10.5;相对变化低于样本v = np.random.rand(500)
.实际上,样本base + np.random.rand(n)
的基尼系数期望值为1/(6*base + 3).
这是基尼系数的简单实现.它利用基尼系数是相对平均绝对差的一半这一事实.
def gini(x):# (警告:这是一个简洁的实现,但它是 O(n**2)# 在时间和内存中,其中 n = len(x).*不要*传递巨大的# 样品!)# 平均绝对差mad = np.abs(np.subtract.outer(x, x)).mean()# 相对平均绝对差rmad = mad/np.mean(x)# 基尼系数g = 0.5 * rmad返回 g
这是v = base + np.random.rand(500)
形式的几个样本的基尼系数:
在[80]中:v = np.random.rand(500)在 [81] 中:gini(v)出[81]:0.32760618249832563在 [82] 中:v = 1 + np.random.rand(500)在 [83] 中:gini(v)出[83]:0.11121487509454202在 [84] 中:v = 10 + np.random.rand(500)在 [85] 中:gini(v)出[85]:0.01567937753659053在 [86] 中:v = 100 + np.random.rand(500)在 [87] 中:gini(v)出[87]:0.0016594595244509495
i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand()
, the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?
def G(v):
bins = np.linspace(0., 100., 11)
total = float(np.sum(v))
yvals = []
for b in bins:
bin_vals = v[v <= np.percentile(v, b)]
bin_fraction = (np.sum(bin_vals) / total) * 100.0
yvals.append(bin_fraction)
# perfect equality area
pe_area = np.trapz(bins, x=bins)
# lorenz area
lorenz_area = np.trapz(yvals, x=bins)
gini_val = (pe_area - lorenz_area) / float(pe_area)
return bins, yvals, gini_val
v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)
for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.
the result:
uniform distributions should be near "perfect equality" so the lorenz curve bending is off.
This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.
You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500)
. Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500)
.
In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n)
is 1/(6*base + 3).
Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.
def gini(x):
# (Warning: This is a concise implementation, but it is O(n**2)
# in time and memory, where n = len(x). *Don't* pass in huge
# samples!)
# Mean absolute difference
mad = np.abs(np.subtract.outer(x, x)).mean()
# Relative mean absolute difference
rmad = mad/np.mean(x)
# Gini coefficient
g = 0.5 * rmad
return g
Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500)
:
In [80]: v = np.random.rand(500)
In [81]: gini(v)
Out[81]: 0.32760618249832563
In [82]: v = 1 + np.random.rand(500)
In [83]: gini(v)
Out[83]: 0.11121487509454202
In [84]: v = 10 + np.random.rand(500)
In [85]: gini(v)
Out[85]: 0.01567937753659053
In [86]: v = 100 + np.random.rand(500)
In [87]: gini(v)
Out[87]: 0.0016594595244509495
这篇关于在 Python/numpy 中计算基尼系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!