基于后验联合概率的采样数据 [英] sampling data based on posterior joint-probabilities

查看:53
本文介绍了基于后验联合概率的采样数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,想根据我手动设置的概率获取样本.

I have a dataset and would like get a sample based on probabilities that I manually set.

示例:(id = user, score(sort by desc), b1-b6(dummy variable)),1代表用户有这个特性,0否则

Example: (id = user, score(sort by desc), b1-b6(dummy variable)), 1 represents users have this feature, 0 otherwise

id score b1 b2 b3 b4 b5 b6

1 0.99 1 0 0 0 1 0

2 0.98 1 0 0 0 0 0

3 0.97 1 1 1 0 1 1

4 0.96 0 1 0 0 0 0

给出了一个参数集 (p1,p2,p3,p4,p5,p6),用于控制列 (b1,b2,b3,b4,b5,b6) 分别

A parameter set (p1,p2,p3,p4,p5,p6) is given that controls the proportion of users having this feature in columns (b1,b2,b3,b4,b5,b6) respectively

让我们看看我设置 p1 = 0.1, p2 = 0.2, p3 = 0.9, p4 = 0.32, p5 = 0.2, p6 = 0.21并且期望从分布近似遵循 p1-p6 值的数据集中采样.

Let's see I set p1 = 0.1, p2 = 0.2, p3 = 0.9, p4 = 0.32, p5 = 0.2, p6 = 0.21 And it's expected to sample from the dataset whose distribution is approximately follow the p1-p6 values.

大约 10% 的用户在 b1 中有 1,20% 的用户在 b2 中有 1,依此类推)

问题是原始数据集在 b1 到 b6 之间有分布,以及如何从中获取样本,其分布遵循 p1-p6 值

Problem is the original dataset has its distributions across b1 to b6, and how to get a sample from it, which has the distributions that follows the p1-p6 values

任何想法将不胜感激

更新它是从遵循分布(p1、p2 等)的大型数据集(1000k 中的 1k 样本)中抽取样本,而不是模拟虚假数据

UPDATES It's to draw a sample from a large dataset (1k sample from 1000k) that follows the distributions (p1,p2 etc.),instead of simulating phony data

方法一:可以通过重复随机抽样来解决.并使用最接近的(需要重新采样或迭代技巧).

Approach 1: It may be solved by repeating random sample. and using the closest one(need resampling or iteration tricks).

方法二:使用线性优化算法(可能比较复杂,有2^6种可能,需要解大方程)

Approach 2: using linear optimisation algorithm(may be complicated, as 2^6 possibilities, and needs to solve large equations)

推荐答案

Henry,正如评论中所建议的,有两种通用方法可以生成这样的数据.一种是计算每个单元格为0或1的概率",另一种是向量的随机抽样,使得n%被选中.两者是完全不同的(至少在规模不大的情况下).

Henry, as suggested in the comments, there are two general ways to produce data like this. One is to calculate the "probability that each cell will be 0 or 1", and the other is "random sampling of the vector such that n% are selected". The two are completely different (at least in not-large scales).

演示.基本概率/比例:

A demonstration. Base probabilities/proportions:

probs <- c(0.1, 0.2, 0.9, 0.32, 0.2, 0.21)
names(probs) <- paste0('b', seq_along(probs))

set.seed(2)
n <- 1e5
dat <- cbind.data.frame(sapply(probs, function(p) {
  sample(0:1, size=n, replace=TRUE, prob=c(1-p, p))
}))
head(dat)
#   b1 b2 b3 b4 b5 b6
# 1  0  0  1  1  0  1
# 2  0  0  0  1  1  0
# 3  0  0  1  1  0  0
# 4  0  0  1  0  0  0
# 5  1  0  1  0  0  0
# 6  1  0  1  0  1  0
colSums(dat)/n
#      b1      b2      b3      b4      b5      b6 
# 0.10125 0.20100 0.89975 0.32013 0.20182 0.20827 

这看起来差不多,比例非常接近.现在让我们看看一个较小的群体:

This looks about right, the proportions are pretty close. Now let's look at a smaller population:

set.seed(2)
n <- 10
dat <- cbind.data.frame(sapply(probs, function(p) {
  sample(0:1, size=n, replace=TRUE, prob=c(1-p, p))
}))
dat
#    b1 b2 b3 b4 b5 b6
# 1   0  0  1  0  1  0
# 2   0  0  1  0  0  0
# 3   0  0  1  1  0  0
# 4   0  0  1  1  0  1
# 5   1  0  1  0  1  0
# 6   1  1  1  0  0  1
# 7   0  1  1  1  1  0
# 8   0  0  1  0  0  1
# 9   0  0  0  0  0  0
# 10  0  0  1  0  1  0
colSums(dat)/n
#  b1  b2  b3  b4  b5  b6 
# 0.2 0.2 0.9 0.3 0.4 0.3 

即使在舍入范围内,对于某些列来说,这甚至不是接近".这就是问题.为此,我们对随机性的看法"实际上是一次一个单元格",而不是一次一列".

That's not even "close" for some of the columns, even within rounding. This is the problem. For this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time".

好的,让我们尝试一次做一列.

Okay, let's try to do it one column at a time.

set.seed(2)
n <- 10
dat <- cbind.data.frame(sapply(probs, function(p) {
  i <- sample(n, size=n*p)
  vec <- integer(n)
  vec[i] <- 1
  vec
}))
dat
#    b1 b2 b3 b4 b5 b6
# 1   0  0  1  0  0  0
# 2   1  0  1  1  0  0
# 3   0  0  1  0  0  1
# 4   0  0  0  1  0  0
# 5   0  0  1  0  0  1
# 6   0  1  1  0  0  0
# 7   0  0  1  0  0  0
# 8   0  1  1  1  0  0
# 9   0  0  1  0  1  0
# 10  0  0  1  0  1  0
colSums(dat)/n
#  b1  b2  b3  b4  b5  b6 
# 0.1 0.2 0.9 0.3 0.2 0.2 

这看起来更接近,在四舍五入内.(您可以选择使用 size=ceiling(n*p) 或者 size=max(1,n*p) 来处理低概率,否则它会被截断,而不是四舍五入.)请注意,对于较大的群体,它的行为仍然与上述实现一样.

This looks much closer, within rounding. (You may choose to use size=ceiling(n*p) or perhaps size=max(1,n*p) to handle low probabilities, as otherwise it is truncated, not rounded.) Notice that with larger populations it will still behave as well as the implementation above.

幸运的是,它们的性能大致相同,因此您可以选择满足采样要求的任何一个.

Luckily, they both perform about the same, so you can choose whichever meets your sampling requirements.

library(microbenchmark)
n <- 10
microbenchmark(
  probability = cbind.data.frame(sapply(probs, function(p) { sample(0:1, size=n, replace=TRUE, prob=c(1-p, p)) })),
  proportion = cbind.data.frame(sapply(probs, function(p) { i <- sample(n, size=n*p); vec <- integer(n); vec[i] <- 1; vec; }))
)
# Unit: microseconds
#         expr     min       lq     mean   median       uq     max neval
#  probability  99.191 104.6620 126.0461 114.5075 139.4880 384.001   100
#   proportion 106.485 113.2315 131.9465 122.7135 149.1515 213.334   100
n <- 1e5
...
# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
#  probability 254.9634 298.0875 349.3892 331.2826 364.0245 680.3098   100
#   proportion 281.7271 351.9515 418.4833 386.5976 449.6032 931.0893   100

这篇关于基于后验联合概率的采样数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆