Python - 每组 pandas 随机抽样 [英] Python - Pandas random sampling per group
问题描述
我有一个与此非常相似的数据帧,但有数千个值:
I have a dataFrame really similar to that, but with thousands of values :
import numpy as np
import pandas as pd
# Setup fake data.
np.random.seed([3, 1415])
df = pd.DataFrame({
'Class': list('AAAAAAAAAABBBBBBBBBB'),
'type': (['short']*5 + ['long']*5) *2,
'image name': (['image01']*2 + ['image02']*2)*5,
'Value2': np.random.random(20)})
我能够找到一种方法,使用以下代码对每个图像、每个类和每个类型进行 2 个值的随机采样:
I was able to find a way to do a random sampling of 2 values per images, per Class and per Type with the following code :
df2 = df.groupby(['type', 'Class', 'image name'])[['Value2']].apply(lambda s: s.sample(min(len(s),2)))
我得到了以下结果:
我正在寻找一种方法来对该表进行子集化,以便能够为每个类型和每个类随机选择一个随机图像(图像名称")(并为随机选择的图像保留 2 个值.
I'm looking for a way to subset that table to be able to randomly choose a random image ('image name') per type and per Class (and conserve the 2 values for the randomly selected image.
我想要的输出的 Excel 示例:
Excel Example of my desired output :
推荐答案
IIUC,问题是你不想按列image name
分组,但是如果该列未包含在 groupby 中,您将丢失此列
IIUC, the issue is that you do not want to groupby the column image name
, but if that column is not included in the groupby, your will lose this column
您可以先创建 grouby 对象
You can first create the grouby object
gb = df.groupby(['type', 'Class'])
现在您可以使用列表理解对 grouby 块进行交互
Now you can interate over the grouby blocks using list comprehesion
blocks = [data.sample(n=1) for _,data in gb]
现在您可以连接块,以重建随机采样的数据帧
Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe
pd.concat(blocks)
<小时>
输出
Class Value2 image name type
7 A 0.817744 image02 long
17 B 0.199844 image01 long
4 A 0.462691 image01 short
11 B 0.831104 image02 short
或
你可以修改你的代码并将列image name
添加到groupby中
You can modify your code and add the column image name
to the groupby like this
df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))
Value2 image name
type Class
long A 8 0.777962 image01
9 0.757983 image01
B 19 0.100702 image02
15 0.117642 image02
short A 3 0.465239 image02
2 0.460148 image02
B 10 0.934829 image02
11 0.831104 image02
<小时>
保持每组图像相同
Keeping image same per group
我不确定您是否可以避免对这个问题使用迭代过程.您可以循环遍历 groupby 块,过滤组获取随机图像并保持每组相同的名称,然后像这样从剩余的图像中随机采样
Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this
import random
gb = df.groupby(['Class','type'])
ls = []
for index,frame in gb:
ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))
pd.concat(ls)
输出
Class Value2 image name type
6 A 0.850445 image02 long
7 A 0.817744 image02 long
4 A 0.462691 image01 short
0 A 0.444939 image01 short
19 B 0.100702 image02 long
15 B 0.117642 image02 long
10 B 0.934829 image02 short
14 B 0.721535 image02 short
这篇关于Python - 每组 pandas 随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!