Python - 每组 pandas 随机抽样 [英] Python - Pandas random sampling per group

查看:111
本文介绍了Python - 每组 pandas 随机抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与此非常相似的数据帧,但有数千个值:

I have a dataFrame really similar to that, but with thousands of values :

import numpy as np
import pandas as pd 

# Setup fake data.
np.random.seed([3, 1415])      
df = pd.DataFrame({
    'Class': list('AAAAAAAAAABBBBBBBBBB'),
    'type': (['short']*5 + ['long']*5) *2,
    'image name': (['image01']*2  + ['image02']*2)*5,
    'Value2': np.random.random(20)})

我能够找到一种方法,使用以下代码对每个图像、每个类和每个类型进行 2 个值的随机采样:

I was able to find a way to do a random sampling of 2 values per images, per Class and per Type with the following code :

df2 = df.groupby(['type', 'Class', 'image name'])[['Value2']].apply(lambda s: s.sample(min(len(s),2)))

我得到了以下结果:

我正在寻找一种方法来对该表进行子集化,以便能够为每个类型和每个类随机选择一个随机图像(图像名称")(并为随机选择的图像保留 2 个值.

I'm looking for a way to subset that table to be able to randomly choose a random image ('image name') per type and per Class (and conserve the 2 values for the randomly selected image.

我想要的输出的 Excel 示例:

Excel Example of my desired output :

推荐答案

IIUC,问题是你不想按列image name分组,但是如果该列未包含在 groupby 中,您将丢失此列

IIUC, the issue is that you do not want to groupby the column image name, but if that column is not included in the groupby, your will lose this column

您可以先创建 grouby 对象

You can first create the grouby object

gb = df.groupby(['type', 'Class'])

现在您可以使用列表理解对 grouby 块进行交互

Now you can interate over the grouby blocks using list comprehesion

blocks = [data.sample(n=1) for _,data in gb]

现在您可以连接块,以重建随机采样的数据帧

Now you can concatenate the blocks, to reconstruct your randomly sampled dataframe

pd.concat(blocks)

<小时>

输出

   Class    Value2 image name   type
7      A  0.817744    image02   long
17     B  0.199844    image01   long
4      A  0.462691    image01  short
11     B  0.831104    image02  short

你可以修改你的代码并将列image name添加到groupby中

You can modify your code and add the column image name to the groupby like this

df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))

                  Value2 image name
type  Class
long  A     8   0.777962    image01
            9   0.757983    image01
      B     19  0.100702    image02
            15  0.117642    image02
short A     3   0.465239    image02
            2   0.460148    image02
      B     10  0.934829    image02
            11  0.831104    image02

<小时>

保持每组图像相同


Keeping image same per group

我不确定您是否可以避免对这个问题使用迭代过程.您可以循环遍历 groupby 块,过滤组获取随机图像并保持每组相同的名称,然后像这样从剩余的图像中随机采样

Im not sure if you can avoid using an iterative process for this problem. You could just loop over the groupby blocks, filter the groups taking a random image and keeping the same name per group, then randomly sample from the remaining images like this

import random

gb = df.groupby(['Class','type'])
ls = []

for index,frame in gb:
    ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))

pd.concat(ls)

输出

   Class    Value2 image name   type
6      A  0.850445    image02   long
7      A  0.817744    image02   long
4      A  0.462691    image01  short
0      A  0.444939    image01  short
19     B  0.100702    image02   long
15     B  0.117642    image02   long
10     B  0.934829    image02  short
14     B  0.721535    image02  short

这篇关于Python - 每组 pandas 随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆