创建适合以下参数的伪数据集:N,均值,sd,min和max [英] Create a fake dataset that fits the following parameters: N, mean, sd, min, and max

查看:238
本文介绍了创建适合以下参数的伪数据集:N,均值,sd,min和max的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种方法可以创建符合以下参数的伪数据集:N,平均值,标准差,最小值和最大值?

Is there a way to create a fake dataset that fits the following parameters: N, mean, sd, min, and max?

我想创建一个187个整数量表分数的样本,其平均数为67,标准差为17,观测值在[30,210]范围内.我正在尝试演示有关统计能力的概念性课程,并且我想使用看起来像已发布结果的分布来创建数据.在此示例中,规模得分是30个项目的总和,每个项目的范围从1到7.我不需要构成规模得分的单个项目的数据,但这将是一个奖励.

I want to create a sample of 187 integer scale scores that have a mean of 67 and a standard deviation of 17, with observations within the range [30, 210]. I'm trying to demonstrate a conceptual lesson about statistical power, and I would like to create data with a distribution that looks like a published result. The scale score in this example is the sum of 30 items that each could range from 1 to 7. I don't need data for the individual items that make up the scale score, but that would be a bonus.

我知道我可以使用rnorm(),但是这些值不是整数,最小值和最大值可以超过我的可能值.

I know I could use rnorm(), but the values are not integers, and the min and max can exceed my possible values.

scaleScore <- rnorm(187, mean = 67, sd = 17)

我也知道我可以使用sample()来获取保持在该范围内的整数,但是均值和标准差将不正确.

I also know I could use sample() to get integers that stay within this range, but the mean and standard deviation won't be right.

scaleScore <- sample(30:210, 187, replace=TRUE)

@Pascal的技巧将我带到Runuran软件包中的urnorm():

@Pascal's tip led me to urnorm() in the Runuran package:

set.seed(5)
scaleScore <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
mean(scaleScore)
# [1] 68.51758
sd(scaleScore)
# [1] 16.38056
min(scaleScore)
# [1] 32.15726
max(scaleScore)
# [1] 107.6758

平均值和标准差不是精确的,向量也不由整数组成.

Mean and SD are not exact, of course, and the vector does not consist of integers.

还有其他选择吗?

推荐答案

无模板的整数优化

由于您要具有精确的均值,标准差,最小值和最大值,因此我的首选不是随机数生成,因为您的样本不太可能与您所分布的均值和标准差完全匹配从中汲取.相反,我将采用整数优化方法.您可以将变量x_i定义为样本中整数i出现的次数.您将定义决策变量x_30x_31,...,x_210,并添加确保满足所有条件的约束:

Integer Optimization With No Template

Since you want to have an exact mean, standard deviation, min, and max, my first choice wouldn't be random number generation, since your sample is unlikely to exactly match the mean and standard deviation of the distribution you're drawing from. Instead, I would take an integer optimization approach. You could define variable x_i to be the number of times integer i appears in your sample. You'll define decision variables x_30, x_31, ..., x_210 and add constraints that ensure all your conditions are met:

  • 187个样本:这可以通过约束x_30 + x_31 + ... + x_210 = 187
  • 进行编码
  • 平均值67 :可以通过约束30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
  • 进行编码
  • 变量的逻辑约束:变量必须采用非负整数值
  • 看起来像真实数据" 这显然是一个定义不明确的概念,但是我们可以要求相邻数字的频率相差不超过1.对于每个连续对,格式为x_30 - x_31 <= 1x_30 - x_31 >= -1,依此类推.我们还可以要求每个频率不超过任意定义的上限(我将使用10).
  • 187 samples: This can be encoded by the constraint x_30 + x_31 + ... + x_210 = 187
  • Mean of 67: This can be encoded by the constraint 30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
  • Logical constraints on variables: Variables must take non-negative integer values
  • "Looks Like Real Data" This is obviously an ill-defined concept, but we could require that the frequency of adjacent numbers have a difference of no more than 1. This is linear constraints of the form x_30 - x_31 <= 1, x_30 - x_31 >= -1, and so on for every consecutive pair. We can also require that each frequency does not exceed some arbitrarily defined upper bound (I'll use 10).

最后,我们希望标准偏差尽可能接近17,这意味着我们希望方差尽可能接近17 ^ 2 =289.我们可以将变量y定义为上限关于我们如何紧密匹配此方差,我们可以将y最小化:

Finally, we want the standard deviation to be as close to 17 as possible, meaning we want the variance to be as close as possible to 17^2 = 289. We can define a variable y to be an upper bound on how closely we match this variance, and we can minimize y:

y >= ((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) - (289 * (187-1))
y >= -((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) + (289 * (187-1))

这是一个非常简单的优化问题,可以使用lpSolve这样的求解器来解决:

This is a pretty easy optimization problem to solve with a solver like lpSolve:

library(lpSolve)
get.sample <- function(n, avg, stdev, lb, ub) {
  vals <- lb:ub
  nv <- length(vals)
  mod <- lp(direction = "min",
            objective.in = c(rep(0, nv), 1),
            const.mat = rbind(c(rep(1, nv), 0),
                              c(vals, 0),
                              c(-(vals-avg)^2, 1),
                              c((vals-avg)^2, 1),
                              cbind(diag(nv), rep(0, nv)),
                              cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv)),
                              cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
            const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep("<=", nv), rep(">=", nv)),
            const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), rep(10, nv), rep(1, nv), rep(-1, nv)),
            all.int = TRUE)
  rep(vals, head(mod$solution, -1))
}
samp <- get.sample(187, 67, 17, 30, 210)
summary(samp)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      30      64      69      67      74     119
sd(samp)
# [1] 17
plot(table(samp))

对于您提供的参数,我们能够在返回所有整数值的同时获得确切的均值和标准差,并且计算在0.4秒内在我的计算机中完成.

For the parameters you provided, we were able to get the exact mean and standard deviation while returning all integer values, and the computation completed in my computer in 0.4 seconds.

获取类似于真实数据"的东西的另一种方法是定义开始的连续分布(例如,原始帖子中包含的urnorm函数的结果),然后以一种方式将值四舍五入为整数最能达到您的均值和标准差的目标.这实际上仅引入了两类新的约束:在某个值的样本数量的上限是可以舍入或向下舍入以达到该值的样本数量,而两个连续频率之和的下限值是介于这两个整数之间的连续样本数.再次,这很容易使用lpSolve实现,并且运行起来效率极低:

Another approach to getting something that resembles "real data" would be to define a starting continuous distribution (e.g. the result of the urnorm function that you include in the original post) and to round the values to integers in a way that best achieves your mean and standard deviation objectives. This really only introduces two new classes of constraints: the upper bound on the number of samples at some value is the number of samples that could either be rounded up or down to achieve that value and a lower bound on the sum of two consecutive frequencies is the number of continuous samples that fall between those two integers. Again, this is easy to implement with lpSolve and not terribly inefficient to run:

library(lpSolve)
get.sample2 <- function(n, avg, stdev, lb, ub, init.dist) {
  vals <- lb:ub
  nv <- length(vals)
  lims <- as.vector(table(factor(c(floor(init.dist), ceiling(init.dist)), vals)))
  floors <- as.vector(table(factor(c(floor(init.dist)), vals)))
  mod <- lp(direction = "min",
            objective.in = c(rep(0, nv), 1),
            const.mat = rbind(c(rep(1, nv), 0),
                              c(vals, 0),
                              c(-(vals-avg)^2, 1),
                              c((vals-avg)^2, 1),
                              cbind(diag(nv), rep(0, nv)),
                              cbind(diag(nv) + cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
            const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep(">=", nv)),
            const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), lims, floors),
            all.int = TRUE)
  rep(vals, head(mod$solution, -1))
}

library(Runuran)
set.seed(5)
init.dist <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
samp2 <- get.sample2(187, 67, 17, 30, 210, init.dist)
summary(samp2)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#      32      57      66      67      77     107
sd(samp2)
# [1] 17
plot(table(samp2))

此方法甚至更快(不到0.1秒),并且仍然返回完全符合要求的均值和标准差的分布.此外,如果从连续分布中获得足够高质量的样本,则可以用来获取具有整数值并满足所需统计特性的不同形状的分布.

This approach is even faster (under 0.1 seconds) and still returns a distribution that exactly meets the required mean and standard deviation. Further, given sufficiently high quality samples from continuous distributions, this can be used to get distributions of different shapes that take integer values and meet the required statistical properties.

这篇关于创建适合以下参数的伪数据集:N,均值,sd,min和max的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆