在R中生成具有一定相关性和特定条件的两个序列 [英] Generating two series with a certain correlation and a specific condition in R
问题描述
我想在R中生成两个大小为100的数据系列,其中一个将是缓解时间, tr ,来自Exp(均值= 1)分布,另一个将进入来自Exp(平均值= 2.5)分布的生存时间 t 。我希望它们是负相关的(比如,相关性是-0.5)。但与此同时,我希望R避免 t [i] 的值小于 tr [i] 的数据点 i ,因为生存时间应大于缓解次数。我已经能够使用以下代码在两个变量之间产生一些相关性(尽管相关性没有完全重现):
I want to generate two data series of size 100 in R, one of which is going to be remission time, tr, from Exp(mean=1) distribution and the other one is going to be survival time, t, from Exp(mean=2.5) distribution. I want them to be negatively correlated (say, the correlation is -0.5). But at the same time I want that R avoids the values of t[i] that are less than tr[i] for data point i, because survival times should be greater than remission times. I have been able to produce some correlation between the two variables (although the correlation is not exactly reproduced) using the following codes:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
rawvars <- mvrnorm(100, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr<-rep(0,100)
for(i in 1:100){
tr[i] <- qexp(pvars[,1][i], 1/1)
}
t<-rep(0,100)
for(i in 1:100){
repeat {
t[i] <- qexp(pvars[,2][i], 1/2)
if (t[i]>tr[i]) break
}
}
cor(tr,t)
sum(tr>t) # shows number of invalid cases
但我不明白我应该如何有效地诱导条件,以便R只生成 t 的值大于对应的<强> TR 即可。而且,有更好的方法(更快的方式)在R中完成整个事情吗?感谢您的回复。
But I don't understand how I should efficiently induce the condition so that R only generates values of t that are greater than corresponding tr. Moreover, is there a better way (faster way) to do the whole thing in R? Thanks in advanced for your response.
推荐答案
这里的问题是 qexp
是分位数函数,将以相同的概率返回相同的值 pvars [,2] [i]
。因此,当 pvars [i,]
中的任何一个 t [i]<时,您的代码很容易进入无限循环= TR [I]
。为避免这种情况,您必须为每个 t [i],tr [i]
对失败的对重新生成 rawvars
条件。此外,由于 qexp
和运算符> $,因此无需循环
pvars
c $ c>都是矢量化的。以下代码可以满足您的需求:
The issue here is that qexp
is the quantile function and will return the same value for the same probability pvars[,2][i]
. As a result, your code can easily go into an infinite loop when any one of the pvars[i,]
is such that t[i]<=tr[i]
. To avoid that, you must regenerate your rawvars
for each t[i], tr[i]
pair that fails your condition. In addition, looping over pvars
is not necessary since qexp
and operator >
are all vectorized. The following code does what you want:
rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
library(MASS)
set.seed(1) ## so that results are repeatable
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
keep <- which(t > tr)
return(data.frame(t=t[keep],tr=tr[keep]))
}
n <- 10000 ## generating 10000 instead of 100, this can now be large
paccept <- 1
res <- data.frame()
while (n > 0) {
new.res <- compute.tr.t(n, paccept)
res <- rbind(res, new.res)
paccept <- nrow(new.res) / n
n <- n - nrow(res)
}
注:
-
函数
compute.tr .t
借用拒绝抽样在这里。它的输入参数是我们想要的样本数量和预期的接受概率。有了这个:
The function
compute.tr.t
borrows a technique from rejection sampling here. Its input arguments are the requested number of samples that we want and the expected probability of acceptance. With this:
- 它生成
n = n / paccept
两者的指数变量tr
和t
,因为你要考虑接受的可能性 - 它只保留那些满足条件
t> tr
。
- It generates
n = n / paccept
exponential variates for bothtr
andt
as you do to account for the probability of acceptance - It only keeps those satisfying the condition
t > tr
.
什么 compute.tr.t
返回值可能小于请求的 n
样本。然后我们可以使用这些信息来计算我们需要多少样本以及更新的预期接受概率。
What compute.tr.t
returns may be less than the requested n
samples. We can then use this information to compute how many more samples we need and what the updated expected probability of acceptance is.
我们生成满足条件的样本一个,而
循环。在这个循环中:
We generate the samples satisfying our condition in a while
loop. In this loop:
- 我们用请求的号码调用
compute.tr.t
要生成的样本和预期的接受率。最初,这些将分别设置为我们想要的总样本数量和1
。 - 然后将
compute.tr.t
的结果附加到结果数据框res
。 - 更新接受概率只是返回的样本数量与请求数量的比率。
- 根据我们想要的总数,更新所需的样本数量只需要多少。
- 我们在下一个请求的数量时停止样本小于或等于
0
(即我们有足够的样本)。
- We call
compute.tr.t
with a requested number of samples to generate and the expected acceptance rate. Initially, these will be set to how many total samples we want and1
, respectively. - The result of
compute.tr.t
are then appended to the result data frameres
. - Updating the probability of accept is simply the ratio of how many samples were returned over how many were requested.
- Updating the requested number of samples is simply how many more we need from the total number we want.
- We stop when the next requested number of samples is less than or equal to
0
(i.e., we have enough samples).
结果数据框可能包含的内容超过了我们想要的样本总数。
The resulting data frame may contain more than the total number of samples we want.
运行此代码,我们得到:
Running this code, we get:
print(cor(res$tr,res$t))
[1] -0.09128498
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 0
我们注意到反相关性明显弱于预期。这是由于你的情况。如果我们通过修改 compute.tr.t
删除此条件为:
We note that the anti correlation is significantly weaker than expected. This is due to your condition. If we remove this condition by modifying compute.tr.t
as:
compute.tr.t <- function(n, paccept) {
n <- round(n / paccept)
rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
pvars <- pnorm(rawvars)
tr <- qexp(pvars[,1], 1/1)
t <- qexp(pvars[,2], 1/2)
return(data.frame(t=t,tr=tr))
}
然后我们得到:
print(cor(res$tr,res$t))
##[1] -0.3814602
print(sum(res$tr>res$t)) # shows number of invalid cases
##[1] 3676
现在相关性更合理,但无效案件数量很大。
The correlation is now much more reasonable, but the number of invalid cases is significant.
这篇关于在R中生成具有一定相关性和特定条件的两个序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!