在ggplot2中着色boxplot异常点? [英] Coloring boxplot outlier points in ggplot2?

查看:217
本文介绍了在ggplot2中着色boxplot异常点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何着色ggplot2中的异常点?我希望它们和盒子本身的颜色一样。 color = 是不够的。



示例:

  p < -  ggplot(mtcars,aes(factor(cyl),mpg))
p + geom_boxplot(aes(color = factor(cyl)))

我想用 factor(cyl)以及。这是行不通的:

 > p <-ggplot(mtcars,aes(factor(cyl),mpg))
> p + geom_boxplot(aes(color = factor(cyl),outlier.colour = factor(cyl)))


解决方案

为了使异常点的颜色与箱型图相同,您需要计算异常值并分别绘制它们。据我所知,用于着色异常值的内置选项使所有异常值颜色相同。

>

使用与'geom_boxplot'帮助文件相同的数据:

  ggplot( mtcars,aes(x = factor(cyl),y = mpg,col = factor(cyl)))+ 
geom_boxplot()



为异常点着色



现在可能有更简单的方法来做到这一点,但我更喜欢手动计算事物,所以我不必猜测发生了什么。使用'plyr'软件包,我们可以快速获得使用默认(Tukey)方法确定异常值的上限和下限,该异常值是范围[Q1 - 1.5 * IQR,Q3 + 1.5 * IQR]之外的任何点。 Q1和Q3是数据的1/4和3/4分位数,IQR = Q3 - Q1。我们可以将这一切写成一个巨大的声明,但是由于'plyr'包的'mutate'函数将允许我们引用新创建的列,所以我们可以将其分开以便于阅读/调试,如下所示:

  library(plyr)
plot_Data < - ddply(mtcars,。(cyl),mutate,Q1 = quantile(mpg, 1/4),Q3 =分位数(mpg,3/4),IQR = Q3-Q1,upper.limit = Q3 + 1.5 * IQR,lower.limit = Q1-1.5 * IQR)
d),层)。上面的'ddply'语句中的'mutate'函数保留了原始数据框并添加了额外的列,并且。(cyl)的规格说明函数是计算每个'cyl'值的分组。


现在我们可以绘制boxplot,然后用新的彩色点覆盖异常值。

  ggplot()+ 
geom_boxplot(data = plot_Data,aes(x =因子(cyl),y = mpg,col =因子(cyl)))+
geom_point(data = plot_Data [plot_Data $ mpg> plot_Data $ upper.limit | plot_Data $ mpg< plot_Data $ lower.limit,],aes(x = factor(cyl),y = mpg,col = factor(cyl)))



我们在代码中做的是指定一个空的'ggplot'层,然后使用独立数据添加箱形图和点几何图形。 boxplot几何可以使用原始数据框,但我正在使用我们新的'plot_Data'以保持一致。点几何图形然后仅使用我们的新'lower.limit'和'upper.limit'列来确定异常点,以确定异常点状态。由于我们对'x'和'col'美学参数使用相同的规格,所以颜色在箱形图和相应的离群值点之间进行了魔术匹配。

更新:OP请求对此代码中使用的ddply功能进行更完整的说明。这里是:

'plyr'函数族基本上是对数据进行子集并对每个数据子集执行函数的一种方法。在这个特殊情况下,我们有这样的说法:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ ddply(mtcars,。(cyl),mutate,Q1 = quantile(mpg ,1/4),Q3 =分位数(mpg,3/4),IQR = Q3-Q1,upper.limit = Q3 + 1.5 * IQR,lower.limit = Q1-1.5 * IQR)

让我们按照语句写入的顺序来分解它。首先,选择'ddply'函数。我们想要计算'mtcars'数据中每个'cyl'值的下限和上限。我们可以编写一个'for'循环或其他语句来计算这些值,但之后我们必须编写另一个逻辑块来评估异常状态。相反,我们希望使用'ddply'来计算下限和上限,并将这些值添加到每一行。我们选择'ddply'(而不是'dlply','d_ply'等),因为我们输入一个数据帧并且想要一个数据帧作为输出。这给了我们:

  ddply(

我们希望在'mtcars'数据框中执行语句,所以我们添加它。

  ddply(mtcars,

现在,我们要使用'cyl'一个分组变量,我们使用'plyr'函数。()来引用变量本身而不是变量的值,如下所示:

  ddply(mtcars,。(cyl),

下一个参数指定要应用于每个组的函数,我们希望计算为旧数据添加新行,因此我们选择mutate函数,这会保留旧数据并将新计算添加为新的列,这与汇总等其他功能形成鲜明对比,除了分组变量之外,它还会删除所有旧列。

c $ c> ddply(mtcars,。(cyl),mutate,

最后一系列参数是我们要创建的所有新数据列。我们通过指定一个名字(未加引号)和一个表达式来定义它们。首先,我们创建'Q1'列。

  ddply(mtcars,。(cyl),mutate,Q1 = quantile(mpg ,1/4),

'Q3'列的计算方法相似。

  ddply(mtcars,。(cyl),mutate,Q1 = quantile(mpg,1/4),Q3 = quantile(mpg,3/4) ,

幸运的是,通过'mutate'函数,我们可以使用新创建的列作为定义的一部分我们需要使用'Q1'和'Q3'来计算'IQR'变量的四分位数间距,并且我们需要使用'Q1'和'Q3'来计算'IQR'变量的四分位数间距。使用'mutate'函数很容易。

  ddply(mtcars,。(cyl),mutate,Q1 = quantile(mpg, 1/4),Q3 =分位数(mpg,3/4),IQR = Q3-Q1,

我们终于到了现在想要的地步,我们在技术上不需要'Q1','Q3'和'IQR'列,但它确实使我们下限和上限方程很容易读取和调试。我们可以写出我们的表达式,就像理论公式一样: limits = + / - 1.5 * IQR

 (mpg,1/4),Q3 =分位数(mpg,3/4),IQR = Q3-Q1,upper.limit = Q3 + 1.5 * IQR,lower.limit = Q1-1.5 * IQR)

可读性,这就是新数据框的样子:

  plot_Data [,c(-3:-11)] 
#mpg cyl Q1 Q3 IQR upper.limit lower.limit
#1 22.8 4 22.80 30.40 7.60 41.800 11.400
#2 24.4 4 22.80 30.40 7.60 41.800 11.400
#3 22.8 4 22.80 30.40 7.60 41.800 11.400
#4 32.4 4 22.80 30.40 7.60 41.800 11.400
#5 30.4 4 22.80 30.40 7.60 41.800 11.400
#6 33.9 4 22.80 30.40 7.60 41.800 11.400
#7 21.5 4 22.80 30.40 7.60 41.800 11.400
#8 27.3 4 22.80 30.40 7.60 41.800 11.400
#9 26.0 4 22.80 30.40 7.60 41.800 11.400
#10 30.4 4 22.80 30.40 7.60 41.800 11.400
#11 21.4 4 22.80 30.40 7.60 41.800 11.400
#12 21.0 6 18.65 21.00 2.35 24.525 15.125
#13 21.0 6 18.65 21.00 2.35 24.525 15.125
#14 21.4 6 18.65 21.00 2.35 24.525 15.125
#15 18.1 6 18.65 21.00 2.35 24.525 15.125
#16 19.2 6 18.65 21.00 2.35 24.525 15.125
# 17 17.8 6 18.65 21.00 2.35 24.525 15.125
#18 19.7 6 18.65 21.00 2.35 24.525 15.125
#19 18.7 8 14.40 16.25 1.85 19.025 11.625
#20 14.3 8 14.40 16.25 1.85 19.025 11.625
#21 16.4 8 14.40 16.25 1.85 19.025 11.625
#22 17.3 8 14.40 16.25 1.85 19.025 11.625
#23 15.2 8 14.40 16.25 1.85 19.025 11.625
#24 10.4 8 14.40 16.25 1.85 19.025 11.625
#25 10.4 8 14.40 16.25 1.85 19.025 11.625
#26 14.7 8 14.40 16.25 1.85 19.025 11.625
#27 15.5 8 14.40 16.25 1.85 19.025 11.625
#28 15.2 8 14.40 16.25 1.85 19.025 11.625
#29 13.3 8 14.40 16.25 1.85 19.025 11.625
#30 19.2 8 14.40 16.25 1.85 19.025 11.625
#31 15.8 8 14.40 16.25 1.85 19.025 11.625
#32 15.0 8 14.40 16.25 1.85 19.025 11.625
pre>

只是为了提供一个对比,如果我们用'summarize'函数做同样的'ddply'语句,相反,我们可以得到所有相同的答案但不包括其他数据的列。

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $> 1/4),Q3 =分位数(mpg,3/4),IQR = Q3-Q1,上限= Q3 + 1.5 * IQR,lower.limit = Q1-1.5 * IQR)
#cyl Q1 Q3 IQR upper.limit lower.limit
#1 4 22.80 30.40 7.60 41.800 11.400
#2 6 18.65 21.00 2.35 24.525 15.125
#3 8 14.40 16.25 1 .85 19.025 11.625


How can I color the outlier points in ggplot2? I want them to be the same color as the boxplot itself. colour= is not enough to do this.

Example:

p <- ggplot(mtcars, aes(factor(cyl), mpg))
p + geom_boxplot(aes(colour=factor(cyl)))

I want to color the outliers by factor(cyl) as well. This does not work:

> p <- ggplot(mtcars, aes(factor(cyl), mpg))
> p + geom_boxplot(aes(colour=factor(cyl), outlier.colour=factor(cyl)))

解决方案

In order to color the outlier points the same as your boxplots, you're going to need to calculate the outliers and plot them separately. As far as I know, the built-in option for coloring outliers colors all outliers the same color.

The help file example

Using the same data as the 'geom_boxplot' help file:

ggplot(mtcars, aes(x=factor(cyl), y=mpg, col=factor(cyl))) +
    geom_boxplot()

Coloring the outlier points

Now there may be a more streamlined way to do this, but I prefer to calculate things by hand, so I don't have to guess what's going on under the hood. Using the 'plyr' package, we can quickly get the upper and lower limits for using the default (Tukey) method for determining an outlier, which is any point outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Q1 and Q3 are the 1/4 and 3/4 quantiles of the data, and IQR = Q3 - Q1. We could write this all as one huge statement, but since the 'plyr' package's 'mutate' function will allow us to reference newly-created columns, we might as well split it up for easier reading/debugging, like so:

library(plyr)
plot_Data <- ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

We use the 'ddply' function, because we are inputting a data frame and wanting a data frame as output ("d->d" ply). The 'mutate' function in the above 'ddply' statement is preserving the original data frame and adding additional columns, and the specification of .(cyl) is telling the functions to be calculated for each grouping of 'cyl' values.

At this point, we can now plot the boxplot and then overwrite the outliers with new, colored points.

ggplot() +
    geom_boxplot(data=plot_Data, aes(x=factor(cyl), y=mpg, col=factor(cyl))) + 
    geom_point(data=plot_Data[plot_Data$mpg > plot_Data$upper.limit | plot_Data$mpg < plot_Data$lower.limit,], aes(x=factor(cyl), y=mpg, col=factor(cyl)))

What we are doing in the code is to specify an empty 'ggplot' layer and then adding the boxplot and point geometries using independent data. The boxplot geometry could use the original data frame, but I am using our new 'plot_Data' to be consistent. The point geometry is then only plotting the outlier points, using our new 'lower.limit' and 'upper.limit' columns to determine outlier status. Since we use the same specification for the 'x' and 'col' aesthetic arguments, the colors are magically matched between the boxplots and the corresponding outlier points.

Update: The OP requested a more complete explanation of the 'ddply' function used in this code. Here it is:

The 'plyr' family of functions are basically a way of subsetting data and performing a function on each subset of the data. In this particular case, we have the statement:

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

Let's break this down in the order the statement would be written. First, the selection of the 'ddply' function. We want to calculate the lower and upper limits for each value of 'cyl' in the 'mtcars' data. We could write a 'for' loop or other statement to calculate these values, but then we would have to write another logic block later to assess outlier status. Instead, we want to use 'ddply' to calculate the lower and upper limits and add those values to every line. We choose 'ddply' (as opposed to 'dlply', 'd_ply', etc.), because we are inputting a data frame and wanting a data frame as output. This gives us:

ddply(

We want to perform the statement on the 'mtcars' data frame, so we add that.

ddply(mtcars, 

Now, we want to perform our calculations using the 'cyl' values as a grouping variable. We use the 'plyr' function .() to refer to the variable itself rather than to the variable's value, like so:

ddply(mtcars, .(cyl),

The next argument specifies the function to apply to every group. We want our calculation to add new rows to the old data, so we choose the 'mutate' function. This preserves the old data and adds the new calculations as new columns. This is in contrast to other functions like 'summarize', which removes all of the old columns except the grouping varaible(s).

ddply(mtcars, .(cyl), mutate, 

The final series of arguments are all of the new columns of data we want to create. We define these by specifying a name (unquoted) and an expression. First, we create the 'Q1' column.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), 

The 'Q3' column is calculated similarly.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), 

Luckily, with the 'mutate' function, we can use newly created columns as part of the definition of other columns. This saves us from having to write one giant function or from having to run multiple functions. We need to use 'Q1' and 'Q3' in the calculation of the inter-quartile range for the 'IQR' variable, and that's easy with the 'mutate' function.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, 

We're finally where we want to be now. We technically don't need the 'Q1', 'Q3', and 'IQR' columns, but it does make our lower limit and upper limit equations a lot easier to read and debug. We can write our expression just like the theoretical formula: limits=+/- 1.5 * IQR

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

Cutting out the middle columns for readability, this is what the new data frame looks like:

plot_Data[, c(-3:-11)]
#     mpg cyl    Q1    Q3  IQR upper.limit lower.limit
# 1  22.8   4 22.80 30.40 7.60      41.800      11.400
# 2  24.4   4 22.80 30.40 7.60      41.800      11.400
# 3  22.8   4 22.80 30.40 7.60      41.800      11.400
# 4  32.4   4 22.80 30.40 7.60      41.800      11.400
# 5  30.4   4 22.80 30.40 7.60      41.800      11.400
# 6  33.9   4 22.80 30.40 7.60      41.800      11.400
# 7  21.5   4 22.80 30.40 7.60      41.800      11.400
# 8  27.3   4 22.80 30.40 7.60      41.800      11.400
# 9  26.0   4 22.80 30.40 7.60      41.800      11.400
# 10 30.4   4 22.80 30.40 7.60      41.800      11.400
# 11 21.4   4 22.80 30.40 7.60      41.800      11.400
# 12 21.0   6 18.65 21.00 2.35      24.525      15.125
# 13 21.0   6 18.65 21.00 2.35      24.525      15.125
# 14 21.4   6 18.65 21.00 2.35      24.525      15.125
# 15 18.1   6 18.65 21.00 2.35      24.525      15.125
# 16 19.2   6 18.65 21.00 2.35      24.525      15.125
# 17 17.8   6 18.65 21.00 2.35      24.525      15.125
# 18 19.7   6 18.65 21.00 2.35      24.525      15.125
# 19 18.7   8 14.40 16.25 1.85      19.025      11.625
# 20 14.3   8 14.40 16.25 1.85      19.025      11.625
# 21 16.4   8 14.40 16.25 1.85      19.025      11.625
# 22 17.3   8 14.40 16.25 1.85      19.025      11.625
# 23 15.2   8 14.40 16.25 1.85      19.025      11.625
# 24 10.4   8 14.40 16.25 1.85      19.025      11.625
# 25 10.4   8 14.40 16.25 1.85      19.025      11.625
# 26 14.7   8 14.40 16.25 1.85      19.025      11.625
# 27 15.5   8 14.40 16.25 1.85      19.025      11.625
# 28 15.2   8 14.40 16.25 1.85      19.025      11.625
# 29 13.3   8 14.40 16.25 1.85      19.025      11.625
# 30 19.2   8 14.40 16.25 1.85      19.025      11.625
# 31 15.8   8 14.40 16.25 1.85      19.025      11.625
# 32 15.0   8 14.40 16.25 1.85      19.025      11.625

Just to give a contrast, if we were to do the same 'ddply' statement with the 'summarize' function, instead, we would have all of the same answers but without the columns of the other data.

ddply(mtcars, .(cyl), summarize, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)
#   cyl    Q1    Q3  IQR upper.limit lower.limit
# 1   4 22.80 30.40 7.60      41.800      11.400
# 2   6 18.65 21.00 2.35      24.525      15.125
# 3   8 14.40 16.25 1.85      19.025      11.625

这篇关于在ggplot2中着色boxplot异常点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆