将ggplot2与用户定义的stat_function()集成 [英] Integrating ggplot2 with user-defined stat_function()

查看:172
本文介绍了将ggplot2与用户定义的stat_function()集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用 ggplot2 来将混合分布图与叠加的组件分布图包和一个用户定义的函数为它的 stat_function()。我尝试了两种方法。在这两种情况下,分配标识都是正常的:

 迭代次数= 11 
汇总of normalmixEM object:
comp 1 comp 2
lambda 0.348900 0.65110
mu 2.019878 4.27454
sigma 0.237472 0.43542
loglik at estimate:-276.3643

A)然而,在第一种方法中,输出包含以下错误

  eval(expr,envir,enclos) :object'comp.number'not found 

可重现的示例方法如下(忠实是内置的 R 数据集):

  library(ggplot2)
library(mixtools)

DISTRIB_COLORS< - c(green,red)
NUM_COMPONENTS < - 2

set.seed(12345)

mix.info< - normalmixEM(忠实$爆发,k = NUM​​_元件,
maxit = 100,epsilon = 0.01)
汇总(mix.info)

plot.components< - 函数(mix,comp.number){
g < - stat_function(fun = function(mix,comp.number)
{mix $ lambda [comp.number] *
dnorm(x,mean = mix $ mu [comp.number],
sd = mix $ sigma [comp.number])},
geom =line,aes(color = DISTRIB_COLORS [comp.number]))
return(g)
}
$ bg< - ggplot(忠实,aes(x =等待))+
geom_histogram(binwidth = 0.5)

distComps < - lapply(seq(
函数(i)plot.components(mix.info,i))
print(g + distComps)

B)第二种方法不会产生任何错误。然而,唯一可见的情节是混合分布之一。 它的组件分布图并不产生或可见(在某种程度上我相信一条直线水平线y = 0也是可见的,但我不是100%确定的):





以下是此方法的一个可重现的示例

 库(ggplot2)
库(mixtools)

DISTRIB_COLORS< - c(green,red)
NUM_COMPONENTS< - 2

set.seed(12345)

mix.info< - normalmixEM(忠实的$ eruptions,k = NUM​​_COMPONENTS,
maxit = 100,epsilon = 0.01)
汇总mix.info)

plot.components< - function(x,mix,comp.number,...){
mix $ lambda [comp.number] *
dnorm(x,mean = mix $ mu [comp.number],
sd = mix $ sigma [comp.number],...)
}

g < - ggplot(忠实,aes(x =等待))+
geom_histogram(binwidth = 0.5)

distComps < - lapply(seq(NUM_COMPONENTS),函数(i)
stat_function(fun = plot.components,
args = list(mix = mix.info,comp.number = i)))
print(g + distComps)

问题:每种方法中存在哪些问题,以及哪种方法更正确?

更新:发布后几分钟,我意识到我忘记为第二种方法包含 stat_function()的绘图部分,以便相应地如下所示:

  distComps<  -  lapply(seq(NUM_COMPONENTS),function(i)
stat_function fun = plot.components,
args = list(mix = mix.info,comp.number = i)),
geom =line,aes(color = DISTRIB_COLORS [i]))

但是,此更新会产生一个错误,其中的来源并不完全unders tand:

  FUN(1:2 [[1L]],...)中的错误:
未使用的参数(geom =line,list(color = DISTRIB_COLORS [i]))


解决方案

最后,我想出了如何做我想做的事,并重新设计了我的解决方案。我对@Spacedman和@jlhoward针对这个问题(在发布我的问题时我没有看到)回答了部分答案:关于如何使用ggplot2绘制mixEM类型数据的任何建议。但是,我的解决方案有点不同。一方面,我使用@ Spacedman的方法使用 stat_function() - 我尝试在原始版本中使用的想法 - 我喜欢它比另类,这似乎有点太复杂(虽然更灵活)。另一方面,与@ jlhoward的方法类似,我简化了参数传递。我还介绍了一些视觉改进,例如自动选择区分颜色以便更容易地识别组件分布。对于我的EDA,我将这些代码重构为R模块。然而,仍然存在一个问题,我仍然试图弄清楚:为什么组件分布图位于低于的预期密度图,如下所示。任何关于这个问题的建议都将非常感激!



更新:最后,我用缩放 em>,并相应地更新了代码和数字 - y 值需要乘以 binwidth (在这种情况下,它的 0.5 )来计算每个货箱的观察次数。


这里是< (b)b

 库(ggplot2)
库(RColorBrewer)
库(mixtools)

NUM_COMPONENTS< - 2

set.seed(12345)#重现性

数据< - 忠诚$等待#使用R内置数据

#从混合分发'data'中提取'k'分量
mix.info< - normalmixEM(data,k = NUM​​_COMPONENTS,
maxit = 100,epsilon = $ 0.01)
summary(mix.info)

numComponents< - length(mix.info $ sigma)
message(Extracted number of components distributions:,

$ b calc.components< - function(x,mix,comp.number){
mix $ lambda [comp.number] *
dnorm(x,平均=混合$ mu [comp.number],sd = mix $ sigma [comp.number])
}

g< - ggplot(data.frame(x = data))+
geom_histogram(aes(x = data,y = 0.5 * ..density ..),
fill =white,color =black,binwidth = 0.5)

#我们可以随机选择需要的颜色数量:
#DISTRIB_COLORS< - sample(colors(),numComponents)

#或者更好地使用颜色区分更多的调色板:
DISTRIB_COLORS< - brewer.pal(numComponents,Set1)

distComps < - lapply(seq(numComponents),function(i)
stat_function(fun =组件,
arg = list(mix = mix.info,comp.number = i),
geom =line ,#对于多边形
size = 2,
color = DISTRIB_COLORS [i]))
print(g + distComps)


I'm trying to superimpose a mixed distribution plot with a plot of identified component distributions, using ggplot2 package and a user-defined function for its stat_function(). I have tried two approaches. The distribution identification is normal in both cases:

number of iterations= 11 
summary of normalmixEM object:
         comp 1  comp 2
lambda 0.348900 0.65110
mu     2.019878 4.27454
sigma  0.237472 0.43542
loglik at estimate:  -276.3643 

A) However, in the first approach, the output contains the following error:

Error in eval(expr, envir, enclos) : object 'comp.number' not found

The reproducible example for this approach follows (faithful is a built-in R dataset):

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(mix, comp.number) {
  g <- stat_function(fun = function(mix, comp.number) 
  {mix$lambda[comp.number] *
     dnorm(x, mean = mix$mu[comp.number],
           sd = mix$sigma[comp.number])}, 
  geom = "line", aes(colour = DISTRIB_COLORS[comp.number]))
  return (g)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS),
                    function(i) plot.components(mix.info, i))
print(g + distComps)

B) The second approach doesn't produce any errors. However, the only plot visible is the one of the mixed distribution. Plots of its component distributions are not produced or visible (with some degree of confidence it seems to me that the a straight horizontal line y=0 is also visible, but I'm not 100% sure):

The following is a reproducible example for this approach:

library(ggplot2)
library(mixtools)

DISTRIB_COLORS <- c("green", "red")
NUM_COMPONENTS <- 2

set.seed(12345)

mix.info <- normalmixEM(faithful$eruptions, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

plot.components <- function(x, mix, comp.number, ...) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number],
          sd = mix$sigma[comp.number], ...)
}

g <- ggplot(faithful, aes(x = waiting)) +
  geom_histogram(binwidth = 0.5)

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)))
print(g + distComps)

Question: What are the problems in each of the approaches and which one is (more) correct?

UPDATE: Just minutes after posting, I realized that I forgot to include the line-drawing part of the stat_function() for the second approach, so that the corresponding lines are as follow:

distComps <- lapply(seq(NUM_COMPONENTS), function(i)
  stat_function(fun = plot.components,
                args = list(mix = mix.info, comp.number = i)),
  geom = "line", aes(colour = DISTRIB_COLORS[i]))

However, this update produces an error, source of which I don't quite understand:

Error in FUN(1:2[[1L]], ...) : 
  unused arguments (geom = "line", list(colour = DISTRIB_COLORS[i]))

解决方案

Finally I have figured out how to do what I wanted and reworked my solution. I've adapted parts of answers by @Spacedman and @jlhoward for this question (which I haven't seen at the time of posting my question): Any suggestions for how I can plot mixEM type data using ggplot2. However, my solution is a little different. On one hand, I've used @Spacedman's approach of using stat_function() - the same idea I've tried to use in my original version - I like it better than the alternative, which seems a bit too complex (while more flexible). On the other hand, similarly to @jlhoward's approach, I've simplified parameter passing. I've also introduced some visual improvements, such as automatic selection of differentiated colors for the easier component distributions identification. For my EDA, I've refactored this code as an R module. However, there is still one issue, which I'm still trying to figure out: why the component distribution plots are located below the expected density plots, as shown below. Any advice on this issue will be much appreciated!

UPDATE: Finally, I've figured out the issue with scaling and updated the code and the figure accordingly - the y values need to be multiplied by the value of binwidth (in this case, it's 0.5) to account for the number of observations per bin.

Here's the complete reworked reproducible solution:

library(ggplot2)
library(RColorBrewer)
library(mixtools)

NUM_COMPONENTS <- 2

set.seed(12345) # for reproducibility

data <- faithful$waiting # use R built-in data

# extract 'k' components from mixed distribution 'data'
mix.info <- normalmixEM(data, k = NUM_COMPONENTS,
                        maxit = 100, epsilon = 0.01)
summary(mix.info)

numComponents <- length(mix.info$sigma)
message("Extracted number of component distributions: ",
        numComponents)

calc.components <- function(x, mix, comp.number) {
  mix$lambda[comp.number] *
    dnorm(x, mean = mix$mu[comp.number], sd = mix$sigma[comp.number])
}

g <- ggplot(data.frame(x = data)) +
  geom_histogram(aes(x = data, y = 0.5 * ..density..),
                 fill = "white", color = "black", binwidth = 0.5)

# we could select needed number of colors randomly:
#DISTRIB_COLORS <- sample(colors(), numComponents)

# or, better, use a palette with more color differentiation:
DISTRIB_COLORS <- brewer.pal(numComponents, "Set1")

distComps <- lapply(seq(numComponents), function(i)
  stat_function(fun = calc.components,
                arg = list(mix = mix.info, comp.number = i),
                geom = "line", # use alpha=.5 for "polygon"
                size = 2,
                color = DISTRIB_COLORS[i]))
print(g + distComps)

这篇关于将ggplot2与用户定义的stat_function()集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆