为什么stat_density(R; ggplot2)和gaussian_kde(Python; scipy)不同? [英] Why do stat_density (R; ggplot2) and gaussian_kde (Python; scipy) differ?

查看：112 发布时间：2020/11/14 0:19:27 python r ggplot2 scipy kde

本文介绍了为什么stat_density(R; ggplot2)和gaussian_kde(Python; scipy)不同?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对一系列可能不会正态分布的分布产生基于KDE的PDF估计.

I am attempting to produce a KDE based PDF estimate on a series of distributions that may not be normally distributed.

我喜欢R中ggplot的stat_density似乎可以识别频率的每个增量变化，但是无法通过Python的scipy-stats-gaussian_kde方法复制它，该方法似乎过于平滑.

I like the way ggplot's stat_density in R seems to recognize every incremental bump in frequency, but cannot replicate this via Python's scipy-stats-gaussian_kde method, which seems to oversmooth.

我已经按照以下步骤设置了我的R代码:

I have set up my R code as follows:

ggplot(test, aes(x=Val, color = as.factor(Class), group=as.factor(Class))) +
             stat_density(geom='line',kernel='gaussian',bw='nrd0' 
                                                            #nrd0='Silverman'
                                                            ,size=1,position='identity')

我的python代码是:

And my python code is:

kde = stats.gaussian_kde(data.ravel())
kde.set_bandwidth(bw_method='silverman')

统计文档显示此处 nrd0是bw调整的Silverman方法.

Stats docs show here that nrd0 is the silverman method for bw adjust.

基于上面的代码，我使用的是相同的内核(高斯)和带宽方法(Silverman).

Based on the code above, I am using the same kernals (gaussian) and bandwidth methods (Silverman).

谁能解释为什么结果如此不同?

Can anyone explain why the results are so different?

推荐答案

Silverman法则是什么似乎有分歧. TL; DR-scipy使用该规则的较差版本，该规则仅适用于正态分布的单峰数据. R使用更好的版本，即两全其美".并且可以在各种密度下使用".

There seems to be disagreement about what Silverman's Rule is. TL;DR - scipy uses a worse version of the rule that only works well on unimodal data that is normally distributed. R uses a better version that is "the best of both worlds" and works "for a wide range of densities".

scipy文档说Silverman的规则是实施为:

The scipy docs say that Silverman's rule is implemented as:

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

其中，d是维数(在您的情况下为1)，neff是有效样本量(点数，假设没有权重).因此，scipy带宽为(n * 3 / 4) ^ (-1 / 5)(乘以标准差，用另一种方法计算得出).

Where d is the number of dimensions (1, in your case) and neff is the effective sample size (number of points, assuming no weights). So the scipy bandwidth is (n * 3 / 4) ^ (-1 / 5) (times the standard deviation, computed in a different method).

相比之下，R的 stats打包文档将Silverman的方法描述为标准偏差和四分位数范围的最小值的0.9倍除以样本大小的1.34倍至负五分之一的幂"，也可以在R代码中进行验证，在命令中键入bw.nrd0控制台显示:

By contrast, R's stats package docs describe Silverman's method as "0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power", which can also be verified in R code, typing bw.nrd0 in the console gives:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

另一方面，

维基百科给出了银人的经验法则"作为估算器的许多可能名称之一:

Wikipedia, on the other hand, gives "Silverman's Rule of Thumb" as one of many possible names for the estimator:

1.06 * sigma * n ^ (-1 / 5)

维基百科版本等同于scipy版本.

The wikipedia version is equivalent to the scipy version.

所有三个来源(scipy文档，Wikipedia和R文档)都引用了相同的原始参考文献:Silverman，B.W. (1986). 统计和数据分析的密度估算.伦敦:查普曼&霍尔/CRC. p. 48. ISBN 978-0-412-24620-3. Wikipedia和R特别引用了第48页，而scipy的文档未提及页码. (我已向Wikipedia提交了修改，以将其页面引用更新为第45页，请参见下文.)

All three sources (scipy docs, Wikipedia, and R docs) cite the same original reference: Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. p. 48. ISBN 978-0-412-24620-3. Wikipedia and R specifically cite page 48, whereas scipy's docs do not mention a page number. (I have submitted an edit to Wikipedia to update its page reference to p.45, see below.)

阅读Silverman论文，第45页，公式3.28是Wikipedia文章(4 / 3) ^ (1 / 5) * sigma * n ^ (-1 / 5) ~= 1.06 * sigma * n ^ (-1 / 5)中使用的公式. Scipy使用相同的方法，将(4 / 3) ^ (1 / 5)重写为等效的(3 / 4) ^ (-1 / 5). Silverman描述了这种方法:

Reading the Silverman paper, on page 45, equation 3.28 is what is used in the Wikipedia article: (4 / 3) ^ (1 / 5) * sigma * n ^ (-1 / 5) ~= 1.06 * sigma * n ^ (-1 / 5). Scipy uses the same method, rewriting (4 / 3) ^ (1 / 5) as the equivalent (3 / 4) ^ (-1 / 5). Silverman describes this method:

如果总体上呈正态分布，则(3.28)会很好地工作，但如果总体上是多峰的，它可能会在某种程度上过度平滑...随着混合物变得更加强烈的双峰，相对于公式(3.28)而言，越来越多的平滑平滑参数的最佳选择.

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

scipy文档引用此弱点，并说明:

The scipy docs reference this weakness, stating:

它包括自动确定带宽.估计最适合单峰分布；双峰或多峰分布趋于平滑.

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

但是，Silverman的文章继续，改进了scipy所使用的方法，以改进R和Stata所使用的方法.在第48页上，我们得到等式3.31:

However, the Silverman article continues, improving on the method scipy uses to get to the method used by R and Stata. On page 48, we get the equation 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman将该方法描述为:

Silverman describes this method as:

两全其美...总而言之，平滑参数的选择([eqn] 3.31)在各种密度下都非常适用，并且评估起来很简单.对于许多目的，将一定窗口宽度的适当选择，对其他人而言将是起点用于随后的微调良好的.

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

因此，似乎Wikipedia和Scipy使用了Silverman提出的已知弱点的估计器的简单版本. R和Stata使用更好的版本.

So, it seems Wikipedia and Scipy use a simple version of an estimator proposed by Silverman with known weaknesses. R and Stata use a better version.

这篇关于为什么stat_density(R; ggplot2)和gaussian_kde(Python; scipy)不同?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么stat_density(R; ggplot2)和gaussian_kde(Python; scipy)不同? [英] Why do stat_density (R; ggplot2) and gaussian_kde (Python; scipy) differ?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么stat_density(R; ggplot2)和gaussian_kde(Python; scipy)不同? [英] Why do stat_density (R; ggplot2) and gaussian_kde (Python; scipy) differ?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭