在R中模拟具有多种概率分布的数据 [英] Simulating data in R with multiple probability distributions

查看:148
本文介绍了在R中模拟具有多种概率分布的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过自举来模拟数据,以使用漏斗图为我的真实数据创建置信带.我正在将接受的答案应用于上一个问题.我不想使用单个概率分布来模拟我的数据,而是要修改它以根据要模拟的数据部分使用不同的概率分布.

I am trying to simulate data via bootstrapping to create confidence bands for my real data with a funnel plot. I am building on the strategy of the accepted answer to a previous question. Instead of using a single probability distribution for simulating my data I want to modify it to use different probability distributions depending on the part of the data being simulated.

我非常感谢任何能帮助回答问题或帮助我更清楚地表达问题的人.

I greatly appreciate anyone who can help answer the question or help me phrase the question more clearly.

我的问题是编写适当的R代码以执行更复杂的数据模拟形式.

My problem is writing the appropriate R code to do a more complicated form of data simulation.

当前代码为:

n <- 1e4
set.seed(42)
sims <- sapply(1:80, 
               function(k) 
                 rowSums(
                   replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)

此代码模拟数据,其中每个数据点都有一个值,该值是1:80个观测值之间的平均值. 例如,当数据点的值是10个观测值的平均值(k = 10)时,它将基于a随机采样10个值(可以是0.1、0.2、0.3、0.4、0.5、0.6或0.7).概率分布ps,它给出每个值的概率(基于整个经验分布).

This code simulates data where each data point has a value which is the mean of between 1:80 observations. For example, when the values of the data points are the mean of 10 observations (k=10) it randomly samples 10 values (which can be either 0.1,0.2,0.3, 0.4, 0.5,0.6 or 0.7) based on a probability distribution ps, which gives the probability of each value (based on the entire empirical distribution).

ps看起来像这样:

ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
#        0.1         0.2         0.3         0.4         0.5         0.6         0.7 
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124 

例如观察值是0.1的概率是0.582089552.

eg probability that the value of an observation is 0.1 is 0.582089552.

现在,我不想在所有模拟中使用一个频率分布,而是希望根据每个数据点所依据的观察次数有条件地使用不同的频率分布.

Now instead of using one frequency distribution for all simulations I would like to use different frequency distributions conditionally depending on the number of observations underlying each datapoint.

我制作了一个表cond_probs,其中每个我的实际数据点都有一行.有一栏包含total观察值,一列给出了每个观察值的频率.

I made a table, cond_probs, that has a row for each of my real data points. There is a column with the total number of observations and a column giving the frequency of each of the values for each observation.

cond_probs表的示例:

Example of the cond_probs table:

gene_name   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 total
A1  0.664   0.319   0.018   0.000   0.000   0.000   0.000   0.000   0.000   113.000
A2  0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000

因此对于数据点A2,仅存在1观测值,其值为0.1.因此,0.1观测值的频率为1.对于A1,存在113个观测值,并且大多数观测值(0.664)的值为0.1.这个想法是cond_probs就像ps,但是cond_probs对于每个数据点都有一个概率分布,而不是对所有数据都有一个概率分布.

So for the data point A2, there is only 1 observation, which has a value of 0.1. Therefore the frequency of the 0.1 observations is 1. For A1, there are 113 observations and the majority of those (0.664) have the value 0.1. The idea is that cond_probs is like ps, but cond_probs has a probability distribution for each data point rather than one for all the data.

我想修改上面的代码,以便将采样修改为使用cond_probs而不是ps进行频率分布.并在选择cond_probs中的哪一行时使用观察数k作为标准.因此它将像这样工作:

I would like to modify the above code so that the sampling is modified to use cond_probs instead of ps for the frequency distribution. And to use the number of observations, k , as a criteria when choosing which row in cond_probs to sample from. So it would work like this:

对于观察值为k的数据点:

For data points with k number of observations:

cond_probs表中查找并随机选择一行,其中total个观测值的大小类似于k:0.9k-1.1k.如果没有这样的行,请继续.

look in the cond_probs table and randomly select a row where the total number of observations is similar in size to k: 0.9k-1.1k. If no such rows exist, continue.

一旦选择了一个数据点,就像使用原始代码中的ps一样,使用cond_probs中该行的概率分布,随机采样k个观测值并输出这些观测值的平均值.

Once a datapoint is selected, use the probability distribution from that line in cond_probs just like ps is used in the original code, to randomly sample k number of observations and output the mean of these observations.

对于replicate的每个n迭代,从cond_probs的值与k的当前值相似的所有行中随机采样并替换cond_probs的新数据点(0.9k-1.1k).

For each of the n iterations of replicate, randomly sample with replacement a new data point from cond_probs, out of all rows where the value of total is similar to the current value of k ( 0.9k-1.1k).

想法是,对于此数据集,应根据数据点基础上的观察数来确定要使用哪种概率分布.这是因为在此数据集中,观察的可能性受观察次数的影响(由于遗传连锁和背景选择,具有更多SNP的基因每观察一次的得分往往较低).

The idea is that for this dataset one should condition which probability distribution to use based on the number of observations underlying a data point. This is because in this dataset the probability of an observation is influenced by the number of observations (genes with more SNPs tend to have a lower score per observation due to genetic linkage and background selection).

在下面使用答案进行更新:

UPDATE USING ANSWER BELOW:

我尝试使用下面的答案,它适用于示例中的模拟cond_probs数据,但不适用于我的实际cond_probs文件. 我导入了cond_probs文件并将其转换为具有

I tried using the answer below and it works for the simulated cond_probs data in the example but not for my real cond_probs file. I imported and converted my cond_probs file to a matrix with

cond_probs <- read.table("cond_probs.txt", header = TRUE, check.names = FALSE)
cond_probs <- as.matrix(cond_probs)

第一个示例为10行(约20,000行)如下:

and the first example ten rows (out of ~20,000 rows) looks like this:

>cond_probs
       total   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
[1,]     109 0.404 0.174 0.064 0.183 0.165 0.009 0.000 0.000 0.000 0.000
[2,]     181 0.564 0.221 0.144 0.066 0.006 0.000 0.000 0.000 0.000 0.000
[3,]     289 0.388 0.166 0.118 0.114 0.090 0.093 0.028 0.003 0.000 0.000
[4,]     388 0.601 0.214 0.139 0.039 0.008 0.000 0.000 0.000 0.000 0.000
[5,]     133 0.541 0.331 0.113 0.000 0.008 0.008 0.000 0.000 0.000 0.000
[6,]     221 0.525 0.376 0.068 0.032 0.000 0.000 0.000 0.000 0.000 0.000
[7,]     147 0.517 0.190 0.150 0.054 0.034 0.048 0.007 0.000 0.000 0.000
[8,]     107 0.458 0.196 0.252 0.084 0.009 0.000 0.000 0.000 0.000 0.000
[9,]      13 0.846 0.154 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

如果我跑步:

sampleSize <- 20
set.seed(42)
#replace 1:80 with 1: max number of SNPs in gene in dataset
sims_test <- sapply( 1:50, simulateData, sampleSize )

再看x个观察值的抽样均值,我只有一个结果,应该是20.

and look at the means from the sampling with x number of observations I only get a single result, when there should be 20.

例如:

> sims_test[[31]]
[1] 0.1

并且sims_test的订购方式与sims相同:

And sims_test is not ordered in the same way as sims:

>sims_test
   [,1] [,2]      [,3]  [,4] [,5]      [,6]      [,7]   [,8]      [,9]
 [1,]  0.1  0.1 0.1666667 0.200 0.14 0.2666667 0.2000000 0.2375 0.1888889
 [2,]  0.1  0.1 0.1333333 0.200 0.14 0.2333333 0.1571429 0.2625 0.1222222
 [3,]  0.1  0.1 0.3333333 0.225 0.14 0.1833333 0.2285714 0.2125 0.1555556
 [4,]  0.1  0.1 0.2666667 0.250 0.10 0.1500000 0.2000000 0.2625 0.2777778
 [5,]  0.1  0.1 0.3000000 0.200 0.16 0.2000000 0.2428571 0.1750 0.1000000
 [6,]  0.1  0.1 0.3666667 0.250 0.16 0.1666667 0.2142857 0.2500 0.2000000
 [7,]  0.1  0.1 0.4000000 0.300 0.12 0.2166667 0.1857143 0.2375 0.1666667
 [8,]  0.1  0.1 0.4000000 0.250 0.10 0.2500000 0.2714286 0.2375 0.2888889
 [9,]  0.1  0.1 0.1333333 0.300 0.14 0.1666667 0.1714286 0.2750 0.2888889

更新2

使用cond_probs <- head(cond_probs,n)我确定代码可以工作到n = 517,然后对于大于此值的所有大小,它会产生与上述相同的输出.我不确定这是文件本身还是内存问题.我发现如果删除第518行并在数次之前复制这些行以制作一个更大的文件,它会起作用,这表明该行本身是造成问题的原因. 518行如下所示:

UPDATE 2

Using cond_probs <- head(cond_probs,n) I have determined that the code works until n = 517 then for all sizes greater than this it produces the same output as above. I am not sure if this is an issue with the file itself or a memory issue. I found that if I remove line 518 and duplicate the lines before several times to make a larger file, it works, suggesting that the line itself is causing the problem. Line 518 looks like this:

9.000   0.889   0.000   0.000   0.000   0.111   0.000   0.000   0.000   0.000   0.000

我发现了另外4条违规行:

I found another 4 offending lines:

9.000   0.444   0.333   0.111   0.111   0.000   0.000   0.000   0.000   0.000   0.000

9.000   0.444   0.333   0.111   0.111   0.000   0.000   0.000   0.000   0.000   0.000

9.000   0.111   0.222   0.222   0.111   0.111   0.222   0.000   0.000   0.000   0.000

9.000   0.667   0.111   0.000   0.000   0.000   0.222   0.000   0.000   0.000   0.000

我没有发现任何异常之处.他们全部有9个网站.如果删除这些行并运行仅包含这些行之前的行的"cond_probs"文件,则该代码有效.但由于整个'cond_probs'仍然无法正常工作,因此还必须存在其他问题行.

I don't notice anything unusual about them. They all have a 'total' of 9 sites. If I remove these lines and run the 'cond_probs' file containing only the lines BEFORE these then the code works. But there must be other problematic lines as the entire 'cond_probs' still doesn't work.

我尝试将这些有问题的行放回到一个较小的"cond_probs"文件中,然后该文件可以工作,因此我感到非常困惑,因为看起来这些行并不是天生就有问题的.另一方面,它们全部有9个站点,这说明存在某种致病模式.

I tried putting these problematic lines back into a smaller 'cond_probs' file and this file then works, so I am very confused as it doesn't seem the lines are inherently problematic. On the other hand the fact they all have 9 total sites suggests some kind of causative pattern.

如果有帮助,我很乐意私下共享整个文件,因为我不知道下一步该怎么做进行故障排除.

I would be happy to share the entire file privately if that helps as I don't know what to do next for troubleshooting.

出现的另一个问题是我不确定代码是否按预期工作.我制作了一个虚拟的cond_probs文件,其中有两个数据点,观测值的总和为"1":

One further issue that comes up is I'm not sure if the code is working as expected. I made a dummy cond_probs file where there are two data points with a 'total' of '1' observation:

total   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000
1.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000

因此,我希望对它们的采样都以"1"的观察值进行采样,因此大约有50%的观察结果的平均值为"0.2",而有50%的观察结果的平均值为"0.6".但是平均值始终为0.2:

So I would expect them to both be sampled for data points with '1' observation and therefore get roughly 50% of observations with a mean of '0.2' and 50% with a mean of '0.6'. However the mean is always 0.2:

sims_test[[1]]
 [1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

即使我采样10000次,所有观察结果也都是0.2,而从不为0.6.我对代码的理解是,对于每个观察值,它应该从cond_probs中随机选择一个具有相似大小的新行,但是在这种情况下,似乎并没有这样做.是我误解了代码还是输入不正确仍然存在问题?

Even if I sample 10000 times all observations are 0.2 and never 0.6. My understanding of the code is that it should be randomly selecting a new row from cond_probs with similar size for each observation, but in this case is seems not to be doing so. Do I misunderstand the code or is it still a problem with my input not being correct?

整个cond_probs文件可在以下地址找到:

The entire cond_probs file can be found at the following address:

cond_probs

在运行模拟时将sapply更改为lapply可以解决此问题.

Changing sapply to lapply when running the simulations fixed this issue.

我认为保持cond_probs不变并选择分布sampleSize次数的另一个原因可能是最好的解决方案:选择分布的可能性应与其在cond_probs中的频率相关.如果我们将分布组合在一起,则选择total 910的分布的可能性将不再取决于这些总数的观察值.例如:如果有90分布的total=1010分布的total=9,应该有一个90%机会选择具有total=10的分布.如果我们合并分布,那么选择总数" = 9或10(这不是理想的)的分布的几率就不会变成50/50吗?

Another reason I think leaving cond_probs as it is and choosing a distribution sampleSize number of times might be the best solution: The probability of choosing a distribution should be related to its frequency in cond_probs. If we combine distributions the odds of picking a distribution with total 9 or 10 will no longer depend on the number of observations with these totals. Example: If there are 90 distributions with total=10 and 10 with total=9 there should be a 90% chance to choose a distribution with total=10. If we combine distributions wouldn't the odds become 50/50 for choosing a distribution with 'total'= 9 or 10 (which would not be ideal)?

推荐答案

我简单地编写了一个函数ps,该函数从cond_probs中选择适当的分布:

I simply wrote a function ps that chooses an appropriate distribution from cond_probs:

N <- 10  # The sampled values are 0.1, 0.2, ... , N/10
M <- 8   # number of distributions in "cond_probs"

#-------------------------------------------------------------------
# Example data:

set.seed(1)

cond_probs <- matrix(0,M,N)

is.numeric(cond_probs)

for(i in 1:nrow(cond_probs)){ cond_probs[i,] <- dnorm((1:N)/M,i/M,0.01*N) }

is.numeric(cond_probs)

total <- sort( sample(1:80,nrow(cond_probs)) )
cond_probs <- cbind( total, cond_probs/rowSums(cond_probs) )

colnames(cond_probs) <- c( "total", paste("P",1:N,sep="") )

#---------------------------------------------------------------------
# A function that chooses an appropiate distribution from "cond_prob",
# depending on the number of observations "numObs":

ps <- function( numObs,
                similarityLimit = 0.1 )
{
  similar <- which( abs(cond_probs[,"total"] - numObs) / numObs < similarityLimit )

  if ( length(similar) == 0 )
  { 
    return(NA)
  }
  else
  {
    return( cond_probs[similar[sample(1:length(similar),1)],-1] )
  }
}

#-----------------------------------------------------------------
# A function that simulates data using a distribution that is
# appropriate to the number of observations, if possible:

simulateData <- function( numObs, sampleSize )
{
  if (any(is.na(ps(numObs))))
  {
    return (NA)
  }
  else
  {
    return( rowSums(
              replicate(
                numObs,
                replicate( sampleSize, sample((1:N)/10, 1, prob = ps(numObs))))
            ) / numObs )
  }
}

#-----------------------------------------------------------------
# Test:

sampleSize <- 30
set.seed(42)
sims <- lapply( 1:80, simulateData, sampleSize )

cond_probs中的分布:

    total           P1           P2           P3           P4           P5           P6           P7           P8           P9          P10
[1,]    16 6.654875e-01 3.046824e-01 2.923948e-02 5.881753e-04 2.480041e-06 2.191926e-09 4.060763e-13 1.576900e-17 1.283559e-22 2.189990e-28
[2,]    22 2.335299e-01 5.100762e-01 2.335299e-01 2.241119e-02 4.508188e-04 1.900877e-06 1.680045e-09 3.112453e-13 1.208647e-17 9.838095e-23
[3,]    30 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17
[4,]    45 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04 1.858391e-06 1.642495e-09 3.042886e-13
[5,]    49 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06 1.642492e-09
[6,]    68 1.642492e-09 1.858387e-06 4.407417e-04 2.191023e-02 2.283099e-01 4.986746e-01 2.283099e-01 2.191023e-02 4.407417e-04 1.858387e-06
[7,]    70 3.042886e-13 1.642495e-09 1.858391e-06 4.407425e-04 2.191027e-02 2.283103e-01 4.986755e-01 2.283103e-01 2.191027e-02 4.407425e-04
[8,]    77 1.182153e-17 3.044228e-13 1.643219e-09 1.859210e-06 4.409369e-04 2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02

分布方式:

> cond_probs[,-1] %*% (1:10)/10
          [,1]
[1,] 0.1364936
[2,] 0.2046182
[3,] 0.3001330
[4,] 0.4000007
[5,] 0.5000000
[6,] 0.6000000
[7,] 0.6999993
[8,] 0.7998670

31个观测值的模拟数据的平均值:

Means of the simulated data for 31 observations:

> sims[[31]]
 [1] 0.2838710 0.3000000 0.2935484 0.3193548 0.3064516 0.2903226 0.3096774 0.2741935 0.3161290 0.3193548 0.3032258 0.2967742 0.2903226 0.3032258 0.2967742
[16] 0.3129032 0.2967742 0.2806452 0.3129032 0.3032258 0.2935484 0.2935484 0.2903226 0.3096774 0.3161290 0.2741935 0.3161290 0.3193548 0.2935484 0.3032258

适当的分布是第三个分布:

The appopriate distribution is the third one:

> ps(31)
          P1           P2           P3           P4           P5           P6           P7           P8           P9          P10 
2.191993e-02 2.284110e-01 4.988954e-01 2.284110e-01 2.191993e-02 4.409369e-04 1.859210e-06 1.643219e-09 3.044228e-13 1.182153e-17 

这篇关于在R中模拟具有多种概率分布的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆