以等概率从 Pandas 组中随机选择——意外行为 [英] Randomly selecting from Pandas groups with equal probability -- unexpected behavior

查看:37
本文介绍了以等概率从 Pandas 组中随机选择——意外行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从 12 个独特的组中随机抽样,每个组都有不同数量的观察结果.我想从整个群体(数据帧)中随机抽样,每组具有相同的被选中概率.最简单的例子是具有 2 个组的数据框.

 分组概率0 一个 0.251 0.252 b 0.5

using np.random.choice(df['groups'], p=df['probability'], size=100) 现在每次迭代都有 50% 的机会选择 group a 并且有 50% 的机会选择 group b

为了得出我使用的公式的概率:

(1./num_groups)/size_of_groups

或在 Python 中:

num_groups = len(df['groups'].unique()) # 2size_of_groups = df.groupby('label').size() # {a: 2, b: 1}(1./num_groups)/size_of_groups

哪个返回

 组0.250.50

这很好用,直到我超过 10 个独特的组,之后我开始得到奇怪的分布.这是一个小例子:

np.random.seed(1234)组大小 = 12组 = np.arange(group_size)概率 = np.random.uniform(size=group_size)probs = probs/probs.sum()g = np.random.choice(groups, size=10000, p=probs)df = pd.DataFrame({'groups': g})prob_map = ((1./len(df['groups'].unique()))/df.groupby('groups').size()).to_dict()df['probability'] = df['groups'].map(prob_map)plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))plt.xticks(np.arange(group_size))plt.show()

我希望样本量足够大,分布相当均匀,但是当组数为 11+ 时,我得到了这些翅膀.如果我将 group_size 变量更改为 10 或更低,我确实得到了所需的均匀分布.

我不知道问题是出在我计算概率的公式上,还是出在浮点精度问题上?任何人都知道实现此目的的更好方法,或此示例的修复程序?

提前致谢!

解决方案

您正在使用

plt.rcParams['hist.bins']10

<小时>

通过 group_size 作为 bins 参数.

plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),bins=group_size)

I have 12 unique groups that I am trying to randomly sample from, each with a different number of observations. I want to randomly sample from the entire population (dataframe) with each group having the same probability of being selected from. The simplest example of this would be a dataframe with 2 groups.

    groups  probability
0       a       0.25
1       a       0.25
2       b       0.5

using np.random.choice(df['groups'], p=df['probability'], size=100) Each iteration will now have a 50% chance of selecting group a and a 50% chance of selecting group b

To come up with the probabilities I used the formula:

(1. / num_groups) / size_of_groups

or in Python:

num_groups = len(df['groups'].unique())  # 2
size_of_groups = df.groupby('label').size()  # {a: 2, b: 1}
(1. / num_groups) / size_of_groups

Which returns

    groups
a    0.25
b    0.50

This works great until I get past 10 unique groups, after which I start getting weird distributions. Here is a small example:

np.random.seed(1234)

group_size = 12
groups = np.arange(group_size)

probs = np.random.uniform(size=group_size)
probs = probs / probs.sum()

g = np.random.choice(groups, size=10000, p=probs)
df = pd.DataFrame({'groups': g})

prob_map = ((1. / len(df['groups'].unique())) / df.groupby('groups').size()).to_dict()

df['probability'] = df['groups'].map(prob_map)

plt.hist(np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True))
plt.xticks(np.arange(group_size))
plt.show()

I would expect a fairly uniform distribution with a large enough sample size, but I am getting these wings when the number of groups is 11+. If I change the group_size variable to 10 or lower, I do get the desired uniform distribution.

I can't tell if the problem is with my formula for calculating the probabilities, or possibly a floating point precision problem? Anyone know a better way to accomplish this, or a fix for this example?

Thanks in advance!

解决方案

you are using hist which defaults to 10 bins...

plt.rcParams['hist.bins']

10


pass group_size as the bins parameter.

plt.hist(
    np.random.choice(df['groups'], p=df['probability'], size=10000, replace=True),
    bins=group_size)

这篇关于以等概率从 Pandas 组中随机选择——意外行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆