ggplot中的概率热图 [英] Probabilty heatmap in ggplot

查看:197
本文介绍了ggplot中的概率热图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一年前,我问了)



这几乎就是我想要的。除了每个垂直轴应该有不同数量的垃圾箱,即第一个应该有2,第二个3,第三个4(N + 1)。在图形轴6 +7中具有相同数量的仓(7),其中7应该具有8(N + 1)。

如果我是对的,代码这么做的原因是因为它是观察到的数据,如果我运行了更多的试验,我们会得到更多的垃圾箱。我不想依赖试验次数来获得正确数量的垃圾桶。



如何修改此代码以提供正确数量的bins?

解决方案

我已经使用R的 dbinom n = 1:32 试验生成头部的频率并现在绘制图形。这将是你的期望。我已经在SO和 math.stackexchange 中阅读过一些早期的帖子。不过我不明白你为什么要模拟实验,而不是从二项式R.V.生成。如果你能解释它,那会很棒!我将尝试从@Andrie处理模拟解决方案,以检查我是否可以匹配下面显示的输出。现在,您可能会对此感兴趣。

  set.seed(42)
numbet < - 32
numtri < - 1e5
prob = 5/6

require(plyr)
out < - ldply(1:numbet,function(idx){
结果< - dbinom(idx:0,size = idx,prob = prob)
bet< - rep(idx,length(结果))
N < - round (N)/长度(N))
ymin <-c(0,头(seq_along(N)/长度(N),-1))
ymax < - seq_along b数据帧(bet,fill =结果,ymin,ymax)
})

require(ggplot2)
p < - ggplot(out,aes(xmin = bet- 0.5,xmax = bet + 0.5,ymin = ymin,ymax = ymax))+
geom_rect(aes(fill = fill),color =grey80)+
scale_fill_gradient(Outcome,low = 红,高=蓝)+
xlab(下注)

剧情:





编辑:解释你的旧代码fr om Andrie 可以工作,为什么它没有给出你想要的。



基本上,Andrie做了什么一种看待它的方法)是使用这样的想法,即如果你有两个二项分布, X〜B(n,p)并且 Y〜 B(m,p),其中 n,m =大小 p =成功概率,那么它们的和, X + Y = B(n + m,p)(1)。所以, xcum 的目的是获得所有 n = 1:32 的结果,但是要解释它更好,让我一步步构建代码。除了解释之外, xcum 的代码也是非常明显的,它可以在任何时候被构建(没有任何必要的 for-loop 并每次构建一个 cumsum



如果你到目前为止一直跟着我,我们的想法是首先创建一个 numtri * numbet 矩阵,并且每列( length = numtri )具有 0's 1's 的概率= 5/6 1/6 ,即如果你有 numtri = 1000 ,那么你将有〜834 和166 1的 *为每个 numbet 列(= 32这里) 。

  numtri < -  1e3 
numbet < - 32
)set.seed(45)
xcum <-t(replicate(numtri,sample(0:1,numbet,prob = c(5 / 6,1 / 6),replace = TRUE)))

#检查1的数量
> apply(xcum,2,sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166
163 164 176 162 160 177 157 163 166 146 170

#因此,1的数量是大约我们所期望的(大约166)。

现在,每个列都是二项分布样本,其中 n = 1 size = numtri 。如果我们要添加前两列并用此和替换第二列,那么从(1)开始,由于概率相等,我们最终得到一个二项分布,其中 n = 2 。同样,如果你已经添加了前三列并用第三列代替第三列,那么你将得到一个二项式分布,其中 n = 3 等等。 。
这个概念是,如果你累计地添加每一列,那么你最终会得到 numbet 数目二项分布(这里为1到32)。所以,让我们来做。

  xcum < -  t(apply(xcum,1,cumsum))

#可以通过以下方式验证第二列具有相似的概率:
#计算第二列中所有值的频率。
> table(xcum [,2])
0 1 2
694 285 21

> round(numtri * dbinom(2:0,2,prob = 5/6))
[1] 694 278 28
#或多或少相同,好!

如果您将 xcum 分开,以这种方式在每一行上以 cumsum(1:numbet)生成:

  xcum <-xcum / matrix(rep(cumsum(1:numbet),each = numtri),ncol = numbet)

这将与 for-loop xcum 矩阵相同>(如果你使用相同的种子生成它)。然而,我不完全理解Andrie之所以进行这种划分的原因,因为这对于生成所需图形不是必需的。但是,我想这与您所谈论的频率值有关



现在讲解为什么你难以获得图我附加了( n + 1 bins):

对于一个二元分布 n = 1:32 试验, 5/6 作为尾部(失败)概率和 1/6 作为头的概率(成功), k 头的概率由下式给出:

 nCk *(5/6)^(k-1)*(1/6)^ k#其中nCk是n选择k 
pre>

对于我们产生的测试数据,对于 n = 7 n = 8 (试验), k = 0:7 k = 0:8

 #n = 7 
0 1 2 3 4 5
.278 .394 .233 .077 .016 .002

#n = 8
0 1 2 3 4 5
.229 .375 .254 .111 .025 .006

为什么他们都有6个垃圾箱,而不是8个和9个垃圾箱?当然这与 numtri = 1000 的值有关。让我们通过使用 dbinom 直接从二项分布产生概率来理解为什么会发生这种情况,我们来看看这8个和9个分箱中每一个的概率是多少。

 #n = 7 
dbinom(7:0,7,prob = 5/6)
#输出四舍五入至小数点后三位
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000

#n = 8
dbinom(8:0,8,prob = 5/6)
#输出四舍五入至小数点后三位
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000

你会发现对应于 k = 6,7 k = 6,7,8 的概率对应于 n = 7 n = 8 是〜 0 。他们的价值非常低。这里的最小值是 5.8 * 1e-7 实际上( n = 8 k = 8 )。这意味着如果您模拟 1 / 5.8 * 1e7 次,您就有机会获得1个值。如果您在 n = 32和k = 32 中检查相同,则值为 1.256493 * 1e-25 。所以,你必须模拟那么多的值才能得到至少1个结果,其中所有 32 结果都是为 n = 32

这就是为什么你的结果没有特定分箱的值,因为给定 numtri 。出于同样的原因,直接从二项分布产生概率克服了这个问题/限制。

我希望我已经设法足够清晰地编写,以便跟随。



编辑2:
当我模拟刚刚编辑的代码时上面的 numtri = 1e6 ,我得到这个 n = 7 n = 8 并计算 k = 0:7 k = 0:8

pre $ code $#$ n $ 7
0 1 2 3 4 5 6 7
279347 391386 233771 77698 15763 1915 117 3

#n = 8
0 1 2 3 4 5 6 7 8
232835 372466 259856 104116 26041 4271 392 22 1
numtri
的增加,您将获得更多其他丢失的垃圾箱。但它需要大量的时间/内存(如果有的话)。


I asked this question a year ago and got code for this "probability heatmap":

numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix 
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))

mxcum <- reshape(data.frame(xcum), varying=1+1:numbet, 
idvar="trial", v.names="outcome", direction="long", timevar="bet")


library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize, 
            ymin=c(0, head(seq_along(V1)/length(V1), -1)), 
            ymax=seq_along(V1)/length(V1),
            fill=(V1/sum(V1)))
head(mxcum3)

library(ggplot2)

p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) + 
geom_rect(aes(fill=fill), colour="grey80") + 
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")

print(p)

(May need to change this code slightly because of this)

This is almost exactly what I want. Except each vertical shaft should have different numbers of bins, ie the first should have 2, second 3, third 4 (N+1). In the graph shaft 6 +7 have the same number of bins (7), where 7 should have 8 (N+1).

If I'm right, the reason the code does this is because it is the observed data and if I ran more trials we would get more bins. I don't want to rely on the number of trials to get the correct number of bins.

How can I adapt this code to give the correct number of bins?

解决方案

I have used R's dbinom to generate the frequency of heads for n=1:32 trials and plotted the graph now. It will be what you expect. I have read some of your earlier posts here on SO and on math.stackexchange. Still I don't understand why you'd want to simulate the experiment rather than generating from a binomial R.V. If you could explain it, it would be great! I'll try to work on the simulated solution from @Andrie to check out if I can match the output shown below. For now, here's something you might be interested in.

set.seed(42)
numbet <- 32
numtri <- 1e5
prob=5/6

require(plyr)
out <- ldply(1:numbet, function(idx) {
    outcome <- dbinom(idx:0, size=idx, prob=prob)
    bet     <- rep(idx, length(outcome))
    N       <- round(outcome * numtri)
    ymin    <- c(0, head(seq_along(N)/length(N), -1))
    ymax    <- seq_along(N)/length(N)
    data.frame(bet, fill=outcome, ymin, ymax)
})

require(ggplot2)
p <- ggplot(out, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) + 
geom_rect(aes(fill=fill), colour="grey80") + 
scale_fill_gradient("Outcome", low="red", high="blue") +
xlab("Bet")

The plot:

Edit: Explanation of how your old code from Andrie works and why it doesn't give what you intend.

Basically, what Andrie did (or rather one way to look at it) is to use the idea that if you have two binomial distributions, X ~ B(n, p) and Y ~ B(m, p), where n, m = size and p = probability of success, then, their sum, X + Y = B(n + m, p) (1). So, the purpose of xcum is to obtain the outcome for all n = 1:32 tosses, but to explain it better, let me construct the code step by step. Along with the explanation, the code for xcum will also be very obvious and it can be constructed in no time (without any necessity for for-loop and constructing a cumsum everytime.

If you have followed me so far, then, our idea is first to create a numtri * numbet matrix, with each column (length = numtri) having 0's and 1's with probability = 5/6 and 1/6 respectively. That is, if you have numtri = 1000, then, you'll have ~ 834 0's and 166 1's *for each of the numbet columns (=32 here). Let's construct this and test this first.

numtri <- 1e3
numbet <- 32
set.seed(45)
xcum <- t(replicate(numtri, sample(0:1, numbet, prob=c(5/6,1/6), replace = TRUE)))

# check for count of 1's
> apply(xcum, 2, sum)
[1] 169 158 166 166 160 182 164 181 168 140 154 142 169 168 159 187 176 155 151 151 166 
163 164 176 162 160 177 157 163 166 146 170

# So, the count of 1's are "approximately" what we expect (around 166).

Now, each of these columns are samples of binomial distribution with n = 1 and size = numtri. If we were to add the first two columns and replace the second column with this sum, then, from (1), since the probabilities are equal, we'll end up with a binomial distribution with n = 2. Similarly, instead, if you had added the first three columns and replaced th 3rd column by this sum, you would have obtained a binomial distribution with n = 3 and so on... The concept is that if you cumulatively add each column, then you end up with numbet number of binomial distributions (1 to 32 here). So, let's do that.

xcum <- t(apply(xcum, 1, cumsum))

# you can verify that the second column has similar probabilities by this:
# calculate the frequency of all values in 2nd column.
> table(xcum[,2])
  0   1   2 
694 285  21 

> round(numtri * dbinom(2:0, 2, prob=5/6))
[1] 694 278  28
# more or less identical, good!

If you divide the xcum, we have generated thus far by cumsum(1:numbet) over each row in this manner:

xcum <- xcum/matrix(rep(cumsum(1:numbet), each=numtri), ncol = numbet)

this will be identical to the xcum matrix that comes out of the for-loop (if you generate it with the same seed). However I don't quite understand the reason for this division by Andrie as this is not necessary to generate the graph you require. However, I suppose it has something to do with the frequency values you talked about in an earlier post on math.stackexchange

Now on to why you have difficulties obtaining the graph I had attached (with n+1 bins):

For a binomial distribution with n=1:32 trials, 5/6 as probability of tails (failures) and 1/6 as the probability of heads (successes), the probability of k heads is given by:

nCk * (5/6)^(k-1) * (1/6)^k # where nCk is n choose k

For the test data we've generated, for n=7 and n=8 (trials), the probability of k=0:7 and k=0:8 heads are given by:

# n=7
   0    1    2     3     4     5 
.278 .394 .233  .077  .016  .002 

# n=8
   0    1    2    3     4      5 
.229 .375 .254 .111  .025   .006 

Why are they both having 6 bins and not 8 and 9 bins? Of course this has to do with the value of numtri=1000. Let's see what's the probabilities of each of these 8 and 9 bins by generating probabilities directly from the binomial distribution using dbinom to understand why this happens.

# n = 7
dbinom(7:0, 7, prob=5/6)
# output rounded to 3 decimal places
[1] 0.279 0.391 0.234 0.078 0.016 0.002 0.000 0.000

# n = 8
dbinom(8:0, 8, prob=5/6)
# output rounded to 3 decimal places
[1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000

You see that the probabilities corresponding to k=6,7 and k=6,7,8 corresponding to n=7 and n=8 are ~ 0. They are very low in values. The minimum value here is 5.8 * 1e-7 actually (n=8, k=8). This means that you have a chance of getting 1 value if you simulated for 1/5.8 * 1e7 times. If you check the same for n=32 and k=32, the value is 1.256493 * 1e-25. So, you'll have to simulate that many values to get at least 1 result where all 32 outcomes are head for n=32.

This is why your results were not having values for certain bins because the probability of having it is very low for the given numtri. And for the same reason, generating the probabilities directly from the binomial distribution overcomes this problem/limitation.

I hope I've managed to write with enough clarity for you to follow. Let me know if you've trouble going through.

Edit 2: When I simulated the code I've just edited above with numtri=1e6, I get this for n=7 and n=8 and count the number of heads for k=0:7 and k=0:8:

# n = 7
     0      1      2      3      4      5      6      7 
279347 391386 233771  77698  15763   1915    117      3 

# n = 8
     0      1      2      3      4      5      6      7      8 
232835 372466 259856 104116  26041   4271    392     22      1 

Note that, there are k=6 and k=7 now for n=7 and n=8. Also, for n=8, you have a value of 1 for k=8. With increasing numtri you'll obtain more of the other missing bins. But it'll require a huge amount of time/memory (if at all).

这篇关于ggplot中的概率热图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆