Pandas:根据目标分布从 DataFrame 中采样 [英] Pandas: Sampling from a DataFrame according to a target distribution

查看:145
本文介绍了Pandas:根据目标分布从 DataFrame 中采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Pandas DataFrame 包含一个数据集 D 实例,这些实例都有一些连续的值 x.x 以某种方式分布,比如说统一,可以是任何东西.

I have a Pandas DataFrame containing a dataset D of instances which all have some continuous value x. x is distributed in a certain way, say uniform, could be anything.

我想从 D 中抽取 n 个样本,其中 x 有一个我可以采样或近似的目标分布.这是来自一个数据集,这里我只取正态分布.

I want to draw n samples from D for which x has a target distribution that I can sample or approximate. This comes from a dataset, here I just take normal distribution.

如何从 D 中采样实例,使得x 在样本中的分布等于/类似于我指定的任意分布?

How can I sample instances from D such that the distribution of x in the sample is equal/similar to an arbitrary distribution which I specify?

现在,我对一个值 x、子集 D 进行采样,这样它就包含了所有的 x +- eps 和从中提取的样本.但是当数据集变大时,这很慢.人们一定想出了一个更好的解决方案.也许解决方案已经很好,但可以更有效地实施?

Right now, I sample a value x, subset D such that it contains all x +- eps and sample from that. But this is quite slow when the datasets get bigger. People must have come up with a better solution. Maybe the solution is already good but could be implemented more efficiently?

我可以将 x 拆分成层,这样会更快,但是没有这个有没有解决方案?

I could split x into strata, which would be faster, but is there a solution without this?

我当前的代码,运行良好但速度很慢(30k/100k 需要 1 分钟,但我有 200k/700k 左右.)

My current code, which works fine but is slow (1 min for 30k/100k, but I have 200k/700k or so.)

import numpy as np
import pandas as pd
import numpy.random as rnd
from matplotlib import pyplot as plt
from tqdm import tqdm

n_target = 30000
n_dataset = 100000

x_target_distribution = rnd.normal(size=n_target)
# In reality this would be x_target_distribution = my_dataset["x"].sample(n_target, replace=True)

df = pd.DataFrame({
    'instances': np.arange(n_dataset),
    'x': rnd.uniform(-5, 5, size=n_dataset)
    })

plt.hist(df["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

def sample_instance_with_x(x, eps=0.2):
    try:
        return df.loc[abs(df["x"] - x) < eps].sample(1)
    except ValueError: # fallback if no instance possible
        return df.sample(1)

df_sampled_ = [sample_instance_with_x(x) for x in tqdm(x_target_distribution)]
df_sampled = pd.concat(df_sampled_)

plt.hist(df_sampled["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

推荐答案

与其在 df.x 中生成新点并寻找最近邻,不如定义每个点应根据以下方式采样的概率你的目标分布.您可以使用 np.random.choice.对于像这样的高斯目标分布,在一秒左右的时间内从 df.x 中采样了一百万个点:

Rather than generating new points and finding a closest neighbor in df.x, define the probability that each point should be sampled according to your target distribution. You can use np.random.choice. A million points are sampled from df.x in a second or so for a gaussian target distribution like this:

x = np.sort(df.x)
f_x = np.gradient(x)*np.exp(-x**2/2)
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

sample_probs 是关键数量,因为它可以连接回数据帧或用作 df.sample 的参数,例如:

sample_probs is the key quantity, as it can be joined back to the dataframe or used as an argument to df.sample, e.g.:

# sample df rows without replacement
df_samples = df["x"].sort_values().sample(
    n=1000, 
    weights=sample_probs, 
    replace=False,
)

plt.hist(samples, bins=100, density=True)的结果:

让我们看看当原始样本是从高斯分布中抽取并且我们希望从均匀的目标分布中采样时这种方法的执行情况:

Lets see how this method performs when the original samples are drawn from a gaussian distribution and we wish to sample them from a uniform target distribution:

x = np.sort(np.random.normal(size=100000))
f_x = np.gradient(x)*np.ones(len(x))
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

其实还好.从高斯中的尾部采样很少的点,这些点被分配了均匀采样的大概率;这就是为什么它们很稀疏并且比中间部分有更多的样本.

Pretty well, actually. Few points were sampled from the tails in the gaussian which are assigned large probabilities for the uniform sampling; this is why they are sparse and have greater samples than the middle section.

x 形式计算样本的近似概率:

Approximate probabilities are calculated for samples in x in the form:

概率(x_i) ~ delta_x*rho(x_i)

prob(x_i) ~ delta_x*rho(x_i)

其中 rho(x_i) 是密度函数,np.gradient(x) 用作微分值.如果忽略微分权重,f_x 将在重采样中过度表示接近点和不足表示稀疏点.我最初犯了这个错误,影响很小,因为 x 是均匀分布的(但通常会很显着):

where rho(x_i) is the density function and np.gradient(x) is used as a differential value. If the differential weight is ignored, f_x will over-represent close points and under-represent sparse points in the resampling. I made this mistake initially, the effect is small is x is uniformly distributed (but generally can be significant):

这篇关于Pandas:根据目标分布从 DataFrame 中采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆