根据百分比为两个以上的组随机分配对照组和治疗组 [英] Assign control vs. treatment groupings randomly based on % for more than 2 groups
问题描述
小猪回避我自己的上一个问题 python熊猫:根据%
Piggy backing off my own previous question python pandas: assign control vs. treatment groupings randomly based on %
感谢@maxU,我知道如何将随机对照/治疗分组分配给2个组;但是如果我有3个或以上的小组怎么办?
Thanks to @maxU, I know how to assign random control/treatment groupings to 2 groups; but what if I have 3 groups or more?
例如:
df.head()
customer_id | Group | many other columns
ABC 1
CDE 3
BHF 2
NID 1
WKL 3
SDI 2
JSK 1
OSM 3
MPA 2
MAD 1
pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))
Group 1 : 270
Group 2 : 180
Group 3 : 330
当我只有两个组时,我的回答很好:
I have a great answer, when I only have two groups:
df['Flag'] = df.groupby('Group')['customer_id']\
.transform(lambda x: np.random.choice(['Control','Test'], len(x),
p=[.5,.5] if x.name==1 else [.4,.6]))
但是,如果我想以这种方式拆分它:
But what if i want to split it this way:
- 第1组:50%的控制权& 50%测试
- 第2组:40%的控制权和60%测试
- 第3组:控制和控制20% 80%测试
@MaxU的答案很好,但不幸的是,划分并不准确
@MaxU's answer is great, but unfortunately the split is not exact
d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}
df['Flag'] = df.groupby('Group')['customer_id'] \
.transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
当我测试它时,我没有得到精确的分割.
When i test it, I don't get exact splits.
pd.pivot_table(df,index=['Group'],values=["customer_id"],columns=['Flag'], aggfunc=lambda x: len(x.unique()))
Control Treatment
Group 1: 138 132
Group 2: 78 102
Group 3: 79 251
第1组应该是135/135.
Group 1 should be 135/135.
推荐答案
听起来您正在寻找一种将customer_id
分成精确比例而不依赖机会的方法.这是使用pandas.qcut
和np.random.permutation
做到这一点的一种方法.
It sounds like you're looking for a way to split your customer_id
's into exact proportions, and not rely on chance. Here's one way to do that using pandas.qcut
and np.random.permutation
.
In [228]: df = pd.DataFrame({'customer_id': np.random.normal(size=10000),
'group': np.random.choice(['a', 'b', 'c'], size=10000)})
In [229]: proportions = {'a':[.5,.5], 'b':[.4,.6], 'c':[.2,.8]}
In [230]: df.head()
Out[230]:
customer_id group
0 0.6547 c
1 1.4190 a
2 0.4205 a
3 2.3266 a
4 -0.5691 b
In [231]: def assigner(gp):
...: group = gp['group'].iloc[0]
...: cut = pd.qcut(
np.arange(gp.shape[0]),
q=np.cumsum([0] + proportions[group]),
labels=range(len(proportions[group]))
).get_values()
...: return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='assignment')
...:
In [232]: df['assignment'] = df.groupby('group', group_keys=False).apply(assigner)
In [233]: df.head()
Out[233]:
customer_id group assignment
0 0.6547 c 1
1 1.4190 a 1
2 0.4205 a 0
3 2.3266 a 1
4 -0.5691 b 0
In [234]: (df.groupby(['group', 'assignment'])
.size()
.unstack()
.assign(proportion=lambda x: x[0] / (x[0] + x[1])))
Out[234]:
assignment 0 1 proportion
group
a 1659 1658 0.5002
b 1335 2003 0.3999
c 669 2676 0.2000
这是怎么回事?
- 在每个组中,我们都调用函数
assigner
-
assigner
从预定义的词典中获取组名和比例,然后调用pd.qcut
拆分为0(控制)1(处理) -
np.random.permutation
然后随机分配 - 在原始数据框中将其创建为新列
- Within each group we call the function
assigner
assigner
grabs the group name and proportions from the predefined dictionary and callspd.qcut
to split into 0(control) 1(treatment)np.random.permutation
then shuffles the the assignments- Create this as a new column in the original dataframe
这篇关于根据百分比为两个以上的组随机分配对照组和治疗组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!