在ggplot2中着色箱线图异常点? [英] Coloring boxplot outlier points in ggplot2?

查看:169
本文介绍了在ggplot2中着色箱线图异常点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何为 ggplot2 中的离群点着色?我希望它们与箱线图本身的颜色相同.colour= 不足以做到这一点.

How can I color the outlier points in ggplot2? I want them to be the same color as the boxplot itself. colour= is not enough to do this.

示例:

p <- ggplot(mtcars, aes(factor(cyl), mpg))
p + geom_boxplot(aes(colour=factor(cyl)))

我也想通过 factor(cyl) 为异常值着色.这不起作用:

I want to color the outliers by factor(cyl) as well. This does not work:

> p <- ggplot(mtcars, aes(factor(cyl), mpg))
> p + geom_boxplot(aes(colour=factor(cyl), outlier.colour=factor(cyl)))

推荐答案

为了将离群点着色为与箱线图相同的颜色,您将需要计算离群点并分别绘制它们.据我所知,为异常值着色的内置选项将所有异常值着色为相同的颜色.

In order to color the outlier points the same as your boxplots, you're going to need to calculate the outliers and plot them separately. As far as I know, the built-in option for coloring outliers colors all outliers the same color.

帮助文件示例

使用与geom_boxplot"帮助文件相同的数据:

Using the same data as the 'geom_boxplot' help file:

ggplot(mtcars, aes(x=factor(cyl), y=mpg, col=factor(cyl))) +
    geom_boxplot()

给离群点着色

现在可能有一种更简化的方法来做到这一点,但我更喜欢手工计算,所以我不必猜测引擎盖下发生了什么.使用 'plyr' 包,我们可以快速获得使用默认 (Tukey) 方法确定异常值的上限和下限,异常值是范围 [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] 之外的任何点.Q1 和 Q3 是数据的 1/4 和 3/4 分位数,IQR = Q3 - Q1.我们可以将这一切写成一个巨大的语句,但由于 'plyr' 包的 'mutate' 函数将允许我们引用新创建的列,我们不妨将其拆分以便于阅读/调试,如下所示:

Now there may be a more streamlined way to do this, but I prefer to calculate things by hand, so I don't have to guess what's going on under the hood. Using the 'plyr' package, we can quickly get the upper and lower limits for using the default (Tukey) method for determining an outlier, which is any point outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]. Q1 and Q3 are the 1/4 and 3/4 quantiles of the data, and IQR = Q3 - Q1. We could write this all as one huge statement, but since the 'plyr' package's 'mutate' function will allow us to reference newly-created columns, we might as well split it up for easier reading/debugging, like so:

library(plyr)
plot_Data <- ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

我们使用 'ddply' 函数,因为我们正在输入一个数据帧并想要一个数据帧作为输出(d->d"层).上述 'ddply' 语句中的 'mutate' 函数是保留原始数据框并添加额外的列,而 .(cyl) 的规范是告诉要为每个分组计算的函数cyl' 值.

We use the 'ddply' function, because we are inputting a data frame and wanting a data frame as output ("d->d" ply). The 'mutate' function in the above 'ddply' statement is preserving the original data frame and adding additional columns, and the specification of .(cyl) is telling the functions to be calculated for each grouping of 'cyl' values.

此时,我们现在可以绘制箱线图,然后用新的彩色点覆盖异常值.

At this point, we can now plot the boxplot and then overwrite the outliers with new, colored points.

ggplot() +
    geom_boxplot(data=plot_Data, aes(x=factor(cyl), y=mpg, col=factor(cyl))) + 
    geom_point(data=plot_Data[plot_Data$mpg > plot_Data$upper.limit | plot_Data$mpg < plot_Data$lower.limit,], aes(x=factor(cyl), y=mpg, col=factor(cyl)))

我们在代码中所做的是指定一个空的 'ggplot' 层,然后使用独立数据添加箱线图和点几何图形.boxplot 几何图形可以使用原始数据框,但我使用我们的新plot_Data"来保持一致.然后点几何只绘制离群点,使用我们新的lower.limit"和upper.limit"列来确定离群点状态.由于我们对x"和col"美学参数使用相同的规范,因此箱线图和相应的异常点之间的颜色神奇地匹配.

What we are doing in the code is to specify an empty 'ggplot' layer and then adding the boxplot and point geometries using independent data. The boxplot geometry could use the original data frame, but I am using our new 'plot_Data' to be consistent. The point geometry is then only plotting the outlier points, using our new 'lower.limit' and 'upper.limit' columns to determine outlier status. Since we use the same specification for the 'x' and 'col' aesthetic arguments, the colors are magically matched between the boxplots and the corresponding outlier points.

更新:OP 要求对此代码中使用的ddply"函数进行更完整的解释.这是:

Update: The OP requested a more complete explanation of the 'ddply' function used in this code. Here it is:

'plyr' 函数族基本上是一种对数据进行子集设置并对每个数据子集执行函数的方法.在这种特殊情况下,我们有以下语句:

The 'plyr' family of functions are basically a way of subsetting data and performing a function on each subset of the data. In this particular case, we have the statement:

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

让我们按照语句的编写顺序分解一下.首先,选择'ddply'函数.我们要计算 'mtcars' 数据中每个 'cyl' 值的下限和上限.我们可以编写一个for"循环或其他语句来计算这些值,但随后我们将不得不编写另一个逻辑块来评估异常值状态.相反,我们想使用 'ddply' 来计算下限和上限,并将这些值添加到每一行.我们选择 'ddply'(而不是 'dlply'、'd_ply' 等),因为我们正在输入一个数据帧并想要一个数据帧作为输出.这给了我们:

Let's break this down in the order the statement would be written. First, the selection of the 'ddply' function. We want to calculate the lower and upper limits for each value of 'cyl' in the 'mtcars' data. We could write a 'for' loop or other statement to calculate these values, but then we would have to write another logic block later to assess outlier status. Instead, we want to use 'ddply' to calculate the lower and upper limits and add those values to every line. We choose 'ddply' (as opposed to 'dlply', 'd_ply', etc.), because we are inputting a data frame and wanting a data frame as output. This gives us:

ddply(

我们想在 'mtcars' 数据框上执行语句,所以我们添加了它.

We want to perform the statement on the 'mtcars' data frame, so we add that.

ddply(mtcars, 

现在,我们要使用cyl"值作为分组变量来执行计算.我们使用 'plyr' 函数 .() 来引用变量本身而不是变量的值,如下所示:

Now, we want to perform our calculations using the 'cyl' values as a grouping variable. We use the 'plyr' function .() to refer to the variable itself rather than to the variable's value, like so:

ddply(mtcars, .(cyl),

下一个参数指定要应用于每个组的函数.我们希望我们的计算向旧数据添加新行,因此我们选择mutate"函数.这会保留旧数据并将新计算添加为新列.这与 'summarize' 等其他函数形成对比,后者删除除分组变量之外的所有旧列.

The next argument specifies the function to apply to every group. We want our calculation to add new rows to the old data, so we choose the 'mutate' function. This preserves the old data and adds the new calculations as new columns. This is in contrast to other functions like 'summarize', which removes all of the old columns except the grouping varaible(s).

ddply(mtcars, .(cyl), mutate, 

最后的一系列参数是我们要创建的所有新数据列.我们通过指定名称(不带引号)和表达式来定义这些.首先,我们创建Q1"列.

The final series of arguments are all of the new columns of data we want to create. We define these by specifying a name (unquoted) and an expression. First, we create the 'Q1' column.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), 

Q3"列的计算方法类似.

The 'Q3' column is calculated similarly.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), 

幸运的是,通过 'mutate' 函数,我们可以使用新创建的列作为其他列定义的一部分.这使我们不必编写一个巨大的函数或不必运行多个函数.我们需要在计算 'IQR' 变量的四分位间距时使用 'Q1' 和 'Q3',而使用 'mutate' 函数很容易.

Luckily, with the 'mutate' function, we can use newly created columns as part of the definition of other columns. This saves us from having to write one giant function or from having to run multiple functions. We need to use 'Q1' and 'Q3' in the calculation of the inter-quartile range for the 'IQR' variable, and that's easy with the 'mutate' function.

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, 

我们现在终于到达了我们想要的地方.从技术上讲,我们不需要Q1"、Q3"和IQR"列,但它确实使我们的下限和上限方程更易于阅读和调试.我们可以像理论公式一样编写我们的表达式:limits=+/- 1.5 * IQR

We're finally where we want to be now. We technically don't need the 'Q1', 'Q3', and 'IQR' columns, but it does make our lower limit and upper limit equations a lot easier to read and debug. We can write our expression just like the theoretical formula: limits=+/- 1.5 * IQR

ddply(mtcars, .(cyl), mutate, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)

为了可读性,去掉中间的列,这就是新数据框的样子:

Cutting out the middle columns for readability, this is what the new data frame looks like:

plot_Data[, c(-3:-11)]
#     mpg cyl    Q1    Q3  IQR upper.limit lower.limit
# 1  22.8   4 22.80 30.40 7.60      41.800      11.400
# 2  24.4   4 22.80 30.40 7.60      41.800      11.400
# 3  22.8   4 22.80 30.40 7.60      41.800      11.400
# 4  32.4   4 22.80 30.40 7.60      41.800      11.400
# 5  30.4   4 22.80 30.40 7.60      41.800      11.400
# 6  33.9   4 22.80 30.40 7.60      41.800      11.400
# 7  21.5   4 22.80 30.40 7.60      41.800      11.400
# 8  27.3   4 22.80 30.40 7.60      41.800      11.400
# 9  26.0   4 22.80 30.40 7.60      41.800      11.400
# 10 30.4   4 22.80 30.40 7.60      41.800      11.400
# 11 21.4   4 22.80 30.40 7.60      41.800      11.400
# 12 21.0   6 18.65 21.00 2.35      24.525      15.125
# 13 21.0   6 18.65 21.00 2.35      24.525      15.125
# 14 21.4   6 18.65 21.00 2.35      24.525      15.125
# 15 18.1   6 18.65 21.00 2.35      24.525      15.125
# 16 19.2   6 18.65 21.00 2.35      24.525      15.125
# 17 17.8   6 18.65 21.00 2.35      24.525      15.125
# 18 19.7   6 18.65 21.00 2.35      24.525      15.125
# 19 18.7   8 14.40 16.25 1.85      19.025      11.625
# 20 14.3   8 14.40 16.25 1.85      19.025      11.625
# 21 16.4   8 14.40 16.25 1.85      19.025      11.625
# 22 17.3   8 14.40 16.25 1.85      19.025      11.625
# 23 15.2   8 14.40 16.25 1.85      19.025      11.625
# 24 10.4   8 14.40 16.25 1.85      19.025      11.625
# 25 10.4   8 14.40 16.25 1.85      19.025      11.625
# 26 14.7   8 14.40 16.25 1.85      19.025      11.625
# 27 15.5   8 14.40 16.25 1.85      19.025      11.625
# 28 15.2   8 14.40 16.25 1.85      19.025      11.625
# 29 13.3   8 14.40 16.25 1.85      19.025      11.625
# 30 19.2   8 14.40 16.25 1.85      19.025      11.625
# 31 15.8   8 14.40 16.25 1.85      19.025      11.625
# 32 15.0   8 14.40 16.25 1.85      19.025      11.625

作为对比,如果我们使用summarize"函数执行相同的ddply"语句,我们将得到所有相同的答案,但没有其他数据的列.

Just to give a contrast, if we were to do the same 'ddply' statement with the 'summarize' function, instead, we would have all of the same answers but without the columns of the other data.

ddply(mtcars, .(cyl), summarize, Q1=quantile(mpg, 1/4), Q3=quantile(mpg, 3/4), IQR=Q3-Q1, upper.limit=Q3+1.5*IQR, lower.limit=Q1-1.5*IQR)
#   cyl    Q1    Q3  IQR upper.limit lower.limit
# 1   4 22.80 30.40 7.60      41.800      11.400
# 2   6 18.65 21.00 2.35      24.525      15.125
# 3   8 14.40 16.25 1.85      19.025      11.625

这篇关于在ggplot2中着色箱线图异常点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆