改变一个数据集的分布以匹配另一个数据集 [英] Altering distribution of one dataset to match another dataset

查看:60
本文介绍了改变一个数据集的分布以匹配另一个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据集,一个是建模(人工)数据,另一个是观察到的数据.它们的统计分布略有不同,我想强制建模数据与数据分布中观察到的数据分布相匹配.换句话说,我需要建模数据来更好地表示观察数据的尾部.这是一个例子.

I have 2 datasets, one of modeled (artificial) data and another with observed data. They have slightly different statistical distributions and I want to force the modeled data to match the observed data distribution in the spread of the data. In other words, I need the modeled data to better represent the tails of the observed data. Here's an example.

model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)

summary(model)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
16.62   36.98   40.38   40.28   44.91   54.15 

summary(observed)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
12.62   35.54   42.58   41.10   47.76   59.2

如何强制模型数据具有 R 中观察到的可变性?

How can I force the model data to have the variability that the observed has in R?

推荐答案

您是否只是对 observed 的分布进行建模?如果是这样,您可以根据观察结果生成核密度估计,然后从该建模密度分布中重新采样.例如:

Are you just modeling the distribution of observed? If so, you could generate a kernel density estimate from the observations and then resample from that modeled density distribution. For example:

library(ggplot2)

首先,我们根据观测值生成密度估计.这是我们观察值分布的模型.adjust 是决定带宽的参数.默认值为 1.较小的值会导致平滑度较低(即,密度估计更接近于数据中的小尺度结构):

First we generate a density estimate from the observed values. This is our model of the distribution of the observed values. adjust is a parameter that determines the bandwidth. The default value is 1. Smaller values result in less smoothing (i.e., a density estimate that more closely follows small-scale structure in the data):

dens.obs = density(observed, adjust=0.8)

现在,从密度估计中重新采样以获得建模值.我们设置 prob=dens.obs$y 以便 dens.obs$x 中的值被选择的概率与其建模密度成正比.

Now, resample from the density estimate to get the modeled values. We set prob=dens.obs$y so that the probability of a value in dens.obs$x being chosen is proportional to its modeled density.

set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)

将观测值和建模值放入数据框中以准备绘图:

Put observed and modeled values in a data frame in preparation for plotting:

dat = data.frame(value=c(observed,resample.obs), 
                 group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))

下面的 ECDF(经验累积分布函数)图显示,从核密度估计中采样得到的样本具有与观察到的数据相似的分布:

The ECDF (empirical cumulative distribution function) plot below shows that sampling from the kernel density estimate gives samples with a distribution similar to the observed data:

ggplot(dat, aes(value, fill=group, colour=group)) +
  stat_ecdf(geom="step") +
  theme_bw()

您还可以绘制观察数据的密度分布和从建模分布中采样的值(使用与上面使用的 adjust 参数相同的值).

You can also plot the density distribution of the observed data and the values sampled from the modeled distribution (using the same value for the adjust parameter as we used above).

ggplot(dat, aes(value, fill=group, colour=group)) +
  geom_density(alpha=0.4, adjust=0.8) +
  theme_bw()

这篇关于改变一个数据集的分布以匹配另一个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆