在R中使用geom_density_2d()时出错:在"stat_density2d()"中计算失败:带宽必须严格为正 [英] Error using geom_density_2d() in R : Computation failed in `stat_density2d()`: bandwidths must be strictly positive

查看:124
本文介绍了在R中使用geom_density_2d()时出错:在"stat_density2d()"中计算失败:带宽必须严格为正的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了尝试使用ggplot2进行2d密度测试,我使用了代码段:

  ggplot(df,aes(x = S1.x,y = S1.y))+ geom_point()+ geom_density_2d() 

我得到一个错误:在 stat_density2d()中计算失败:带宽必须严格为正"

我的数据框如下:

 >df文字ID S1.x S1.y S2.x S2.yDQ459412 0.000000 0.000000 0.000000 0.000000DQ459413 1.584963 2.358379 4.392317 3.085722DQ459415 0.000000 0.000000 0.000000 0.000000DQ459418 0.000000 0.000000 0.000000 0.000000DQ459419 0.000000 0.000000 4.000000 2.891544DQ459420 0.000000 0.000000 0.000000 0.000000 

此外, var(df [,"S1.x"])>0 var(df [,"S1.y"])>0 .

您可以让 0.1 的值接近零,最终它将不再能够计算分布,并且您将再次得到错误.

一种处理这种情况的通用方法是在数据中添加少量噪声,这是一种模拟事实,即基于连续分布的真实测量进行的任何有意义的计算都应不受该噪声的影响.噪音.

希望有帮助.

In a attempt to make a test 2d density plot with ggplot2, I used the code snippet:

ggplot(df, aes(x = S1.x, y = S1.y)) + geom_point() + geom_density_2d()

and I got the error: "Computation failed in stat_density2d(): bandwidths must be strictly positive"

My dataframe looks like this:

> df

transcriptID S1.x      S1.y      S2.x       S2.y    
DQ459412     0.000000  0.000000  0.000000   0.000000
DQ459413     1.584963  2.358379  4.392317   3.085722    
DQ459415     0.000000  0.000000  0.000000   0.000000    
DQ459418     0.000000  0.000000  0.000000   0.000000    
DQ459419     0.000000  0.000000  4.000000   2.891544    
DQ459420     0.000000  0.000000  0.000000   0.000000      

Also, var(df[,"S1.x"]) > 0 and var(df[,"S1.y"]) > 0.

Fig 1 - 2d density plot with error

However, I got a density plot without error by running:

ggplot(df, aes(x = S2.x, y = S2.y)) + geom_point() + geom_density_2d()

Fig 2 - density plot without error

How do I address the error in Fig 1?

解决方案

So the real problem is that the S1.x and S1.y values only have one non-zero value in their columns. And it turns out that geom_density_2d can't really estimate a density with only a value or two. But read on...

Update:

This question has been asked before, and the answers are usually that you need to have non-zero variance in your data columns. But you do have non-zero variance, so why isn't it working?

  • Looking at the internals of geom_density_2d we see that it uses the MASS::kde2d package function to calculate the distribution.
  • Looking at kde2d we see that it uses MASS::bandwidth.nrd(df$x) to get an estimate of the bandwidth.
  • Looking at the help (which has the code) for bandwidth.nrd we see it uses a rule of thumb that gets the quantile of the distribution, and subtracts the 2nd quantile from the 1st quantile to get a bandwidth estimate.
  • Doing a quantile on your original data we see that the quantiles of the data were zero.
  • And running MASS::kde2d on your original data with that bandwidth.nrd estimate of the bandwidth gives you the same error:

library(MASS)
nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0,0)
s1y <- c(0,2.358379,0,0,0,0) 
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0) 
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)

> quantile(df$s1x)
      0%      25%      50%      75%     100% 
0.000000 0.000000 0.000000 0.000000 1.584963 
> quantile(df$s1y)
      0%      25%      50%      75%     100% 
0.000000 0.000000 0.000000 0.000000 2.358379 

h <- c(MASS::bandwidth.nrd(df$x), MASS::bandwidth.nrd(df$y))
dens <- MASS::kde2d(df$s1x, df$s1y, h = h, n = n,  lims = c(0,1,0,1))

Error in MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0, 1, 0, 1)) : bandwidths must be strictly positive

So the real criteria for using geom_density_2D is that both the x- and the y-data needs to have a non-zero gap between their 1st and 2nd quantiles.

Now to fix it, if I make a small modification - replacing one of the zeros with 0.1, like this:

nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0.1,0)
s1y <- c(0,2.358379,0,0,0.1,0) 
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0) 
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
print(df)

yielding:

  transcriptID     S1.x     S1.y     S2.x     S2.y
1     DQ459412 0.000000 0.000000 0.000000 0.000000
2     DQ459413 1.584963 2.358379 4.392317 3.085722
3     DQ459415 0.000000 0.000000 0.000000 0.000000
4     DQ459418 0.000000 0.000000 0.000000 0.000000
5     DQ459419 0.100000 0.100000 4.000000 2.891544
6     DQ459420 0.000000 0.000000 0.000000 0.000000

Then I get this plot instead of your error.

You can let that 0.1 value approach zero, eventually it will not be able to calculate a distribution anymore and you will get your error again.

One general way to deal with this situation is to add a very small quantity of noise to your data, kind of simulating the fact that any meaningful calculation based on a real measurement from a continuous distribution should be impervious to that small quantity of noise.

Hope that helps.

这篇关于在R中使用geom_density_2d()时出错:在"stat_density2d()"中计算失败:带宽必须严格为正的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆