跨多列的 Pandas 代表性抽样 [英] Pandas representative sampling across multiple columns

查看:42
本文介绍了跨多列的 Pandas 代表性抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个代表人口的数据框,每一列表示该人的不同品质/特征.我如何获得该数据框/总体的样本,该样本代表总体上的所有特征.

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

假设我有一个代表 650 人的数据框,如下所示:

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

我可以得到上面反映单列分布的样本,如下所示:

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

这将返回一个示例数据帧 (df_sample),使得:

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

我的问题是:我如何生成示例数据帧 (df_sample) 以便上述即:

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

是否适用于所有列('name_id' 除外)而不是其中之一?样本量为 140 的 650 人口大约是我正在使用的规模,因此性能不是太大问题.我很乐意接受需要几分钟才能运行的解决方案,因为这仍然比手动生成上述示例要快得多.感谢您的帮助.

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

推荐答案

您创建一个组合特征列,对该列进行加权并将其绘制为权重:

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

这篇关于跨多列的 Pandas 代表性抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆