基于列的分布随机采样 pandas 数据帧 [英] Randomly sampling Pandas dataframe based on distribution of column

查看:84
本文介绍了基于列的分布随机采样 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个非常大的数据框,我想对其进行采样以尽可能匹配数据框的列的分布(在本例中为"bias"列).

Say I have a very large dataframe, which I want to sample to match the distribution of a column of the dataframe as closely as possible (in this case, the 'bias' column).

我跑步:

train['bias'].value_counts(normalize=True)

并查看:

least           0.277220
left            0.250000
right           0.250000
left-center     0.141244
right-center    0.081536

如果我想从样本的"bias"列的分布与该分布匹配的火车数据帧中抽取一个样本,那是最好的解决方法?

If I want to take a sample of the train dataframe where the distribution of the sample's 'bias' column matches this distribution, what would be the best way to go about it?

推荐答案

您可以使用

You can use sample, from the documentation:

从对象轴返回随机的项目样本.

Return a random sample of items from an axis of object.

诀窍是在每个组中使用示例,这是一个代码示例:

The trick is to use sample in each group, a code example:

import pandas as pd

positions = {"least": 0.277220, "left": 0.250000, "right": 0.250000, "left-center": 0.141244, "right-center": 0.081536}
data = [['title-{}-{}'.format(i, position), position] for i in range(1000) for position in positions.keys()]
frame = pd.DataFrame(data=data, columns=['title', 'position'])
print(frame.shape)


def sample(obj, replace=False, total=1000):
    return obj.sample(n=int(positions[obj.name] * total), replace=replace)

result = frame.groupby('position', as_index=False).apply(sample).reset_index(drop=True)
print(result.groupby('position').agg('count'))

输出

(5000, 2)
              title
position           
least           277
left            250
left-center     141
right           250
right-center     81

在上面的示例中,我创建了一个具有5000行2列的数据框,这是输出的第一部分.

In the above example I created a dataframe with 5000 rows and 2 columns, first part of the output.

我假设您有一个位置字典(要将DataFrame转换为字典,请参见

I am assuming you have a positions dictionary (to convert a DataFrame to dictionary see this) with the percentage to be sample from each group and a total parameter (i.e. the total to be sample).

在输出的第二部分中,您可以看到100列中最少有277行,277 / 1000 = 0.277.这是所需数量的近似值,其余组也是如此.需要注意的是,样本数量为999,而不是预期的1000.

In the second part of the output you can see you have 277 least rows out of 100, 277 / 1000 = 0.277. That is an approximation of the required, the same goes for the rest of the groups. There is a caveat though, the count of the samples is 999 instead of the intended 1000.

这篇关于基于列的分布随机采样 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆