分位数不同:箱线图与小提琴图 [英] Differing quantiles: Boxplot vs. Violinplot

查看:203
本文介绍了分位数不同:箱线图与小提琴图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

require(ggplot2)
require(cowplot)
d = iris

ggplot2::ggplot(d, aes(factor(0), Sepal.Length)) + 
    geom_violin(fill="black", alpha=0.2, draw_quantiles = c(0.25, 0.5, 0.75)
                , colour = "red", size = 1.5) +
    stat_boxplot(geom ='errorbar', width = 0.1)+
    geom_boxplot(width = 0.2)+
    facet_grid(. ~ Species, scales = "free_x") +
    xlab("") + 
    ylab (expression(paste("Value"))) +
    coord_cartesian(ylim = c(3.5,9.5)) + 
    scale_y_continuous(breaks = seq(4, 9, 1)) + 
    theme(axis.text.x=element_blank(),
          axis.text.y = element_text(size = rel(1.5)),
          axis.ticks.x = element_blank(),
          strip.background=element_rect(fill="black"),
          strip.text=element_text(color="white", face="bold"),
          legend.position = "none") +
    background_grid(major = "xy", minor = "none") 

据我所知,方框图中方框图分别代表25%和75%的分位数,中位数= 50%.因此,它们应等于draw_quantiles = c(0.25, 0.5, 0.75)自变量中geom_violin绘制的0.25/0.5/0.75分位数.

To my knowledge box ends in boxplots represent the 25% and 75% quantile, respectively, and the median = 50%. So they should be equal to the 0.25/0.5/0.75 quantiles which are drawn by geom_violin in the draw_quantiles = c(0.25, 0.5, 0.75) argument.

中位数和50%分位数拟合.但是,0.25分位数和0.75分位数都不适合箱线图的箱端(请参见图,尤其是"virginica"面).

Median and 50% quantile fit. However, both 0.25 and 0.75 quantile do not fit the box ends of the boxplot (see figure, especially 'virginica' facet).

参考:

  1. http://docs.ggplot2.org/current/geom_violin.html

http://docs.ggplot2.org/current/geom_boxplot.html

推荐答案

这个评论太长了,因此我将其发布为答案.我看到两个潜在的差异来源.首先,我的理解是boxplot指的是boxplot.stats,它使用的hinges非常相似,但不一定与分位数相同. ?boxplot.stats说:

This is too long for a comment, so I post it as an answer. I see two potential sources for the divergence. First, my understanding is that the boxplot refers to boxplot.stats, which uses hinges that are very similar but not necessarily identical to the quantiles. ?boxplot.stats says:

两个铰链"是第一个和第三个四分位数的版本,即 接近分位数(x,c(1,3)/4).铰链等于四分位数等于奇数 n(其中n<-length(x))并且偶数n也不同.而四分位数 对于n %% 4 == 1(n = 1 mod 4)仅相等的观察,铰链 所以另外对于n %% 4 == 2(n = 2 mod 4),并且位于中间 否则有两个观察结果.

The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd n (where n <- length(x)) and differ for even n. Whereas the quartiles only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle of two observations otherwise.

hinge vs quantile区别因此可能是差异的一个来源.

The hinge vs quantile distinction could thus be one source for the difference.

第二,geom_violin是指密度估计.源代码此处指向函数StatYdensity,这使我进入此处.我找不到函数compute_density,但是我认为(同样由于帮助文件中的某些指针)它本质上是density,默认情况下它使用高斯核估计来估计密度.这可能(也可能不会)解释这些差异,但是

Second, geom_violin refers to a density estimate. The source code here points to a function StatYdensity, which leads me to here. I could not find the function compute_density, but I think (also due to some pointers in help files) it is essentially density, which by default uses a Gaussian kernel estimate to estimate the density. This may (or may not) explain the differences, but

by(d$Sepal.Length, d$Species, function(x) boxplot.stats(x, coef=5)$stats )
by(d$Sepal.Length, d$Species, function(v) quantile(density(v)$x))

确实显示了不同的值.因此,我猜想差异是由于我们是基于观测值的经验分布函数还是基于核密度估计来查看分位数,尽管我承认我并没有得出结论.

do show indeed differing values. So, I would guess that the difference is due to whether we look at quantiles based on the empirical distribution function of the observations, or based on kernel density estimates, though I admit that I have not conclusively shown this.

这篇关于分位数不同:箱线图与小提琴图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆