带有R中公式的箱线图 [英] boxplots with formulas in r

查看:63
本文介绍了带有R中公式的箱线图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白公式在R中的工作原理,什么是公式。
那么,如果我有一个包含每月时间序列的向量,该如何创建一个箱形图,将数据按季节逻辑划分?我希望每个月有12个盒子。

解决方案

在存储级别,公式是



注意公式的两边如何由文字表达式组成:

  1:9; 
## [1] 1 2 3 4 5 6 7 8 9
1:9 %% 3L;
## [1] 1 2 0 1 2 0 1 2 0

内部, boxplot()必须独立评估公式的每一边才能生成数据和分组向量,就像您将两个表达式作为单独的参数传递一样。






因此,让我们创建一个每月时间序列箱图的简单演示:

  N<-36L; df<-data.frame(date = seq(as.Date('2016-01-01'),by ='month',len = N),y = rnorm(N)); 
df;
##日期y
## 1 2016-01-01 -1.56004488
## 2 2016-02-01 0.65699747
## 3 2016-03-01 0.05729631
## 4 2016-04-01 -0.02092276
## 5 2016-05-01 0.46673530
## 6 2016-06-01 -0.18652580
## 7 2016-07- 01 0.06228650
## 8 2016-08-01 1.54452267
## 9 2016-09-01 1.06643594
## 2016年10月1日-1.51178160
## 2016年11月-11-01 0.82904673
## 12 2016-12-01 0.37667201
## 13 2017-01-01 -0.10135801
## 14 2017-02-01 0.94692462
# #15 2017-03-01 -1.60781946
## 16 2017-04-01 0.47189753
## 17 2017-05-01 -1.32869317
## 18 2017-06-01 -0.49821455
## 19 2017-07-01 0.54474606
## 20 2017-08-01 0.47565264
## 21 2017-09-01 -0.97494730
## 22 2017-10 -01 -1.22781588
## 23 2017-11-01 -0.34919086
## 24 2017-12-01 -0.78153843
## 25 2018-01-01 -0.59355220
## 26 2018-02-01 -2.58287605
## 27 2018-03-01 1.42148186
## 28 2018-04-01 -1.01278176
## 29201 8-05-01 -0.80961662
## 30 2018-06-01 0.19793126
## 31 2018-07-01 -1.03072915
## 32 2018-08-01 -0.87896416
## 33 2018-09-01 -2.36216655
## 34 2018-10-01 1.82708221
## 35 2018-11-01 0.05579195
## 36 2018-12-01 1.28612246
箱线图(y〜months(date),df);






如果需要,您可以研究源代码,这需要跟踪S3查找过程:

  boxplot; 
##函数(x,...)
## UseMethod( boxplot)
##<字节码:0x600b50760>
##<环境:namespace:graphics>
graphics ::: boxplot.formula;
##函数(公式,数据= NULL,...,子集,na.action = NULL)
## {
## if(missing(formula)||(length(公式)!= 3L))
## stop('formula'丢失或不正确)
## m<-match.call(expand.dots = FALSE)
## if(is.matrix(eval(m $ data,parent.frame())))
## m $ data<-as.data.frame(data)
## m $ .. 。--NULL
## m $ na.action<--na.action
## m [[1L]]<-quote(stats :: model.frame)
## mf<-eval(m,parent.frame())
## response<-attr(attr(mf, terms), response)
## boxplot(split (mf [[response]],mf [-response]),...)
##}
##<字节码:0x6035c67f8>
##<环境:namespace:graphics>

这几乎是令人讨厌的环形交叉路口,非常复杂,但是 graphics ::: boxplot。 Formula()有效地检索(通过 match.call() )导致自身调用的解析树,对其进行一些按摩,最显着的是替换其自身的功能符号 boxplot.formula stats :: model.frame ,然后评估该新的分析树,从而调用 stats :: model。 frame() 。该函数本身非常复杂,并且需要进一步的S3查找,但这是最相关的代码:

  model.frame; 
##函数(公式,...)
## UseMethod( model.frame)
##<字节码:0x601464b18>
##<环境:namespace:stats>
model.frame.default;
##函数(公式,数据= NULL,子集= NULL,na.action = na.fail,
## drop.unused.levels = FALSE,xlev = NULL,...)
## {
##
## ... snip ...
##
## if(!inherits(formula, terms))
##公式<-项(公式,数据=数据)
## env<-环境(公式)
##行名<-.row_names_info(数据,0L)
## vars<-attr(公式,变量)
## predvars<-attr(公式, predvars)
## if(is.null(predvars))
## predvars<-vars
## varnames<-sapply(vars,function(x)paste(deparse(x,width.cutoff = 500),
##塌陷= ))[-1L]
##变量<-eval(predvars,data,env)
##
## ... snip ...
##

因此,最终,它从公式对象中检索单个表达式,并使用 eval() ,并使用给定的data.frame和公式的封闭环境作为上下文,从而给出结果向量:

  attr(terms(y〜months(date),data = df),'variables'); 
## list(y,months(date))
eval(attr(terms(y〜months(date),data = df),'variables'),df);
## [[1]]
## [1] -1.56004488 0.65699747 0.05729631 -0.02092276 0.46673530 -0.18652580 0.06228650 1.54452267 1.06643594 -1.51178160 0.82904673 0.37667201 -0.10135801 0.94692462 -1.60781946 0.47189753 -1.328690.5447 b ## [20] 0.47565264 -0.97494730 -1.22781588 -0.34919086 -0.78153843 -0.59355220 -2.58287605 1.42148186 -1.01278176 -0.80961662 0.19793126 -1.03072915 -0.87896416 -2.36216655 1.82708221 0.05579195 1.28612246
##
## [ ]
## [1] 1月 2月 3月 4月 5月 6月 7月 8月 9月 10月 11月 12月 1月二月三月四月五月六月七月
## [20]八月九月十月十一月十二月一月二月三月 四月五月六月七月八月九月十月十一月十二月
##

重申在前面提到的一点上,请注意在评估过程中如何找不到`〜`()函数。 R公式的任意实现细节是,它们将`〜`()函数编码为公式对象的解析树存储表示形式中的顶级函数符号。公式边的实际求值(如果出现)不涉及对该函数的求值。



最后,让我们考虑一下如果您实际执行该操作会发生什么评估包含公式存储表示形式的整个分析树。回想一下`〜`()函数所做的只是从其参数创建公式。因此,对一个公式求值会产生一个有趣的效果,即吐出与刚刚求出的公式完全相同的公式:

  f1; 
## a + b〜c / d
eval(f1);
## a + b〜c / d
eval(eval(f1));
## a + b〜c / d






我还对解析树和公式写了一些其他的答案,您可能会对它们感兴趣:




I can't understand how formulas work in R, what formulas are. So, if I have a vector containing a monthly time series, how could I create a boxplot of it, where datas are divided in a seasonal logic? I would like to have 12 boxes, one per month.

解决方案

At the storage level, a formula is a parse tree. The parse tree encodes a call to the `~`() function taking one or two arguments. For a one-sided formula, it takes a single argument which represents the RHS, while for a two-sided formula it takes two arguments which represent the LHS and RHS of the formula.

It should be mentioned that the call to `~`() which is embedded in a formula's parse tree storage representation doesn't really mean anything. In general, the `~`() function doesn't actually do anything, other than allowing for the creation of formula objects, either explicitly (e.g. `~`(a+b,c/d)) or using the syntactic sugar feature provided by the R language (e.g. a+b~c/d). The use of the `~`() function in the encoding of the top level of the formula's parse tree storage representation is a fairly arbitrary and inconsequential implementation detail. I'll expand on this later on.


The R language makes it possible to break down parse trees into recursive list structures, which can help us to inspect and understand the structure of these parse trees.

I've written a short recursive function that can do this:

ptunwrap <- function(x) if (typeof(x)=='language') lapply(as.list(x),ptunwrap) else x;

So let's look at an example:

f1 <- a+b~c/d;
f1;
## a + b ~ c/d
ptunwrap(f1);
## [[1]]
## `~`
##
## [[2]]
## [[2]][[1]]
## `+`
##
## [[2]][[2]]
## a
##
## [[2]][[3]]
## b
##
##
## [[3]]
## [[3]][[1]]
## `/`
##
## [[3]][[2]]
## c
##
## [[3]][[3]]
## d
##

When a parse tree contains a function call, the function call is represented as a single list node within the recursive list structure. The symbol of the function is embedded as the first list component of that list, and its arguments are embedded as the subsequent list components of the list.

There are three function calls in the above parse tree. The top-level list represents the call to the `~`() function, as I described earlier. The second top-level list component further branches into another list, which consists of a call to the `+`() function, which itself takes two arguments, the symbols a and b. The third component is similar, representing a call to the `/`() function and again taking two arguments, symbols c and d.


It is important to understand that although a parse tree always represents syntactically valid R code, and has the ability to be evaluated to produce a single result value, it is not necessary to evaluate a parse tree. It is perfectly possible to create a parse tree and never evaluate it.

What, then, is the purpose of creating a parse tree that you never evaluate? In R this is often done to facilitate communication of certain pieces of information to functions using a convenient syntax.

As a random example, the data.table package allows for the following syntactic sugar to add a column to an existing data.table, using the := operator:

library(data.table);
dt <- data.table(a=1:3);
dt[,b:=a*2L];
dt;
##    a b
## 1: 1 2
## 2: 2 4
## 3: 3 6

Internally, data.table uses non-standard argument evaluation to retrieve the parse tree of the second argument (technically third in the function definition; second in syntactic-sugar form) to the `[.data.table`() function, often referred to as the "j argument" due to the parameter name being j. If you want, you can actually examine the source itself directly in R. Here's a snippet of the most relevant piece of code:

data.table:::`[.data.table`;
## function (x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"),
##     mult = "all", roll = FALSE, rollends = if (roll == "nearest") c(TRUE,
##         TRUE) else if (roll >= 0) c(FALSE, TRUE) else c(TRUE,
##         FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"),
##     allow.cartesian = getOption("datatable.allow.cartesian"),
##     drop = NULL, on = NULL)
## {
##
## ... snip ...
##
##     if (!missing(j)) {
##         jsub = substitute(j)
##         if (is.call(jsub))
##             jsub = construct(deconstruct_and_eval(replace_dot(jsub),
##                 parent.frame(), parent.frame()))
##     }
##
## ... snip ...
##

We can see they're using substitute(j) to retrieve the parse tree of the j argument. For the demo above, this is what they would get:

ptunwrap(substitute(b:=a*2L));
## [[1]]
## `:=`
##
## [[2]]
## b
##
## [[3]]
## [[3]][[1]]
## `*`
##
## [[3]][[2]]
## a
##
## [[3]][[3]]
## [1] 2
##

Later on in the code, they test if the top-level function symbol is :=, which is the operator supported for adding (or modifying, or removing with a RHS of NULL) columns to the data.table. If so, they test if the LHS consists of a single bareword, which is taken as the name of the column to add (or modify or remove). Observe that it is, in fact, not possible for them to actually evaluate the LHS of the parse tree in this case, because it consists of a symbol that does not yet exist in the data.table. However, the RHS does end up being evaluated to produce the column vector to be added to the data.table under the new name.

So it should be clear that formulas can be used in a variety of contexts in R, and they are not always evaluated. Sometimes the parse tree is simply inspected to retrieve information passed from the caller to the callee. Even in contexts where they are evaluated, sometimes only the LHS or RHS (or both) will be evaluated independently, ignoring the top-level function symbol that was embedded in the parse tree at the time it was created.


Turning to the boxplot() function, let's look at the documentation on the formula argument:

a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor).

In this case, both sides of the formula end up being evaluated independently, the LHS providing the data vector, and the RHS providing the grouping definition.

A good way to demonstrate this is as follows:

boxplot(1:9~1:9%%3L);

Notice how both sides of the formula consist of literal expressions:

1:9;
## [1] 1 2 3 4 5 6 7 8 9
1:9%%3L;
## [1] 1 2 0 1 2 0 1 2 0

Internally, boxplot() had to independently evaluate each side of the formula to produce the data and grouping vectors, almost as if you had passed the two expressions as separate arguments.


So, let's create a simple demonstration of a monthly time series boxplot:

N <- 36L; df <- data.frame(date=seq(as.Date('2016-01-01'),by='month',len=N),y=rnorm(N));
df;
##          date           y
## 1  2016-01-01 -1.56004488
## 2  2016-02-01  0.65699747
## 3  2016-03-01  0.05729631
## 4  2016-04-01 -0.02092276
## 5  2016-05-01  0.46673530
## 6  2016-06-01 -0.18652580
## 7  2016-07-01  0.06228650
## 8  2016-08-01  1.54452267
## 9  2016-09-01  1.06643594
## 10 2016-10-01 -1.51178160
## 11 2016-11-01  0.82904673
## 12 2016-12-01  0.37667201
## 13 2017-01-01 -0.10135801
## 14 2017-02-01  0.94692462
## 15 2017-03-01 -1.60781946
## 16 2017-04-01  0.47189753
## 17 2017-05-01 -1.32869317
## 18 2017-06-01 -0.49821455
## 19 2017-07-01  0.54474606
## 20 2017-08-01  0.47565264
## 21 2017-09-01 -0.97494730
## 22 2017-10-01 -1.22781588
## 23 2017-11-01 -0.34919086
## 24 2017-12-01 -0.78153843
## 25 2018-01-01 -0.59355220
## 26 2018-02-01 -2.58287605
## 27 2018-03-01  1.42148186
## 28 2018-04-01 -1.01278176
## 29 2018-05-01 -0.80961662
## 30 2018-06-01  0.19793126
## 31 2018-07-01 -1.03072915
## 32 2018-08-01 -0.87896416
## 33 2018-09-01 -2.36216655
## 34 2018-10-01  1.82708221
## 35 2018-11-01  0.05579195
## 36 2018-12-01  1.28612246
boxplot(y~months(date),df);


If you want, you can study the source code, which requires tracing the S3 lookup process:

boxplot;
## function (x, ...)
## UseMethod("boxplot")
## <bytecode: 0x600b50760>
## <environment: namespace:graphics>
graphics:::boxplot.formula;
## function (formula, data = NULL, ..., subset, na.action = NULL)
## {
##     if (missing(formula) || (length(formula) != 3L))
##         stop("'formula' missing or incorrect")
##     m <- match.call(expand.dots = FALSE)
##     if (is.matrix(eval(m$data, parent.frame())))
##         m$data <- as.data.frame(data)
##     m$... <- NULL
##     m$na.action <- na.action
##     m[[1L]] <- quote(stats::model.frame)
##     mf <- eval(m, parent.frame())
##     response <- attr(attr(mf, "terms"), "response")
##     boxplot(split(mf[[response]], mf[-response]), ...)
## }
## <bytecode: 0x6035c67f8>
## <environment: namespace:graphics>

It is almost exasperatingly roundabout and complex, but graphics:::boxplot.formula() effectively retrieves (via match.call()) the parse tree that caused its own invocation, massages it a little, most notably replacing its own function symbol boxplot.formula with stats::model.frame, and then evaluates that new parse tree, thereby calling stats::model.frame(). That function is itself very complex, and involves further S3 lookup, but here's the most relevant code:

model.frame;
## function (formula, ...)
## UseMethod("model.frame")
## <bytecode: 0x601464b18>
## <environment: namespace:stats>
model.frame.default;
## function (formula, data = NULL, subset = NULL, na.action = na.fail,
##     drop.unused.levels = FALSE, xlev = NULL, ...)
## {
##
## ... snip ...
##
##     if (!inherits(formula, "terms"))
##         formula <- terms(formula, data = data)
##     env <- environment(formula)
##     rownames <- .row_names_info(data, 0L)
##     vars <- attr(formula, "variables")
##     predvars <- attr(formula, "predvars")
##     if (is.null(predvars))
##         predvars <- vars
##     varnames <- sapply(vars, function(x) paste(deparse(x, width.cutoff = 500),
##         collapse = " "))[-1L]
##     variables <- eval(predvars, data, env)
##
## ... snip ...
##

So, eventually, it retrieves the individual expressions from the formula object, and evaluates them using eval() with the given data.frame and closure environment of the formula as context, which gives the result vectors:

attr(terms(y~months(date),data=df),'variables');
## list(y, months(date))
eval(attr(terms(y~months(date),data=df),'variables'),df);
## [[1]]
##  [1] -1.56004488  0.65699747  0.05729631 -0.02092276  0.46673530 -0.18652580  0.06228650  1.54452267  1.06643594 -1.51178160  0.82904673  0.37667201 -0.10135801  0.94692462 -1.60781946  0.47189753 -1.32869317 -0.49821455  0.54474606
## [20]  0.47565264 -0.97494730 -1.22781588 -0.34919086 -0.78153843 -0.59355220 -2.58287605  1.42148186 -1.01278176 -0.80961662  0.19793126 -1.03072915 -0.87896416 -2.36216655  1.82708221  0.05579195  1.28612246
##
## [[2]]
##  [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"   "November"  "December"  "January"   "February"  "March"     "April"     "May"       "June"      "July"
## [20] "August"    "September" "October"   "November"  "December"  "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"   "November"  "December"
##

To reiterate a point made earlier, notice how the `~`() function is nowhere to be found in the evaluation process. It is an arbitrary implementation detail of R formulas that they encode the `~`() function as the top-level function symbol in the parse tree storage representation of formula objects. The actual evaluation of the side(s) of the formula, if it occurs, does not involve evaluation of that function.

Finally, let's consider what happens if you do actually evaluate the entire parse tree that comprises the storage representation of a formula. Recall that the `~`() function does nothing more than create a formula from its arguments. Therefore, evaluating a formula has the interesting effect of spitting out the very same formula that was just evaluated:

f1;
## a + b ~ c/d
eval(f1);
## a + b ~ c/d
eval(eval(f1));
## a + b ~ c/d


I've written a couple of other answers on parse trees and formulas which may be of interest to you:

这篇关于带有R中公式的箱线图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆