根据R中的2D密度图计算值的概率 [英] Calculate probability of value based on 2D density plot in R
问题描述
我正在寻找一个函数来计算B和R某种组合的可能性.数据的当前插图如下所示:
I'm looking to work out a function to calculate the likelihood of a certain combination for B and R. The current illustration of the data looks like so:
ggplot(df, aes(R,B)) +
geom_bin2d(binwidth = c(1,1))
有没有一种方法可以根据这两个正相关的离散相关变量来计算每种组合的概率(例如R = 23,B = 30)?
Is there a way to calculate the probabilities of each combination (e.g. R = 23, B = 30) based on these two discrete correlated variables that are positively skewed?
是否可以使用stat_density_2d来解决,还是有更好的方法?
Could it be possible to use the stat_density_2d to solve or could there be a better way?
谢谢.
推荐答案
stat_density_2d
在内部使用MASS::kde2d
.我想有很多方法可以做到这一点,但是我们可以将数据输入该函数并将其转换为整洁的数据,以获得针对该估计类型的平滑版本.
stat_density_2d
uses MASS::kde2d
under the hood. I imagine there are slicker ways to do this, but we can feed the data into that function and convert it into tidy data to get a smoothed version for that type of estimate.
首先,一些像您这样的数据:
First, some data like yours:
library(tidyverse)
set.seed(42)
df <- tibble(
R = rlnorm(1E4, 0, 0.2) * 100,
B = R * rnorm(1E4, 1, 0.2)
)
ggplot(df, aes(R,B)) +
geom_bin2d(binwidth = c(1,1))
这里运行密度,并转换为与数据具有相同坐标的小标题. (有更好的方法吗?)
Here's running the density and converting into a tibble with the same coordinates as the data. (Are there better ways to do this?)
n = 201 # arbitrary grid size, chosen to be 1 more than the range below
# so the breaks are at integers
smooth <- MASS::kde2d(df$R, df$B, lims = c(0, 200, 0, 200),
# h = c(20,20), # could tweak bandwidth here
n = n)
df_smoothed <- smooth$z %>%
as_tibble() %>%
pivot_longer(cols = everything(), names_to = "col", values_to = "val") %>%
mutate(R = rep(smooth$x, each = n), # EDIT: fixed, these were swapped
B = rep(smooth$y, n))
df_smoothed
现在保留R和B维度中从0:200开始的所有坐标,并且每个组合的概率在val
列中.这些加起来几乎等于1(在这种情况下为99.6%).我认为剩余的smidgen是超出指定范围的坐标的概率.
df_smoothed
now holds all the coordinates from 0:200 in the R and B dimensions, with the probability of each combination in the val
column. These add up to 1, of nearly so (99.6% in this case). I think the remaining smidgen is the probabilities of coordinates outside the specified range.
sum(df_smoothed$val)
#[1] 0.9960702
任何特定组合的机会仅仅是该点的密度值.因此,R = 70和B = 100的机会是0.013%.
The chances of any particular combination are just the density value at that point. So the chance of R = 70 and B = 100 would be 0.013%.
df_smoothed %>%
filter(R == 70, B == 100)
## A tibble: 1 x 4
# col val R B
# <chr> <dbl> <int> <int>
#1 V101 0.0000345 70 100
R在50-100之间且B在50-100之间的几率是36.9%:
The chance of R between 50-100 and B between 50-100 would be 36.9%:
df_smoothed %>%
filter(R %>% between(50, 100),
B %>% between(50, 100)) %>%
summarize(total_val = sum(val))
## A tibble: 1 x 1
#total_val
#<dbl>
# 1 0.369
以下是平滑数据和原始数据的外观:
Here's how the smooth and the original data look together:
ggplot() +
geom_tile(data = df_smoothed, aes(R, B, alpha = val), fill = "red") +
geom_point(data = df %>% sample_n(500), aes(R, B), size = 0.2, alpha = 1/5)
这篇关于根据R中的2D密度图计算值的概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!