分组函数(tapply、by、aggregate)和 *apply 系列 [英] Grouping functions (tapply, by, aggregate) and the *apply family
问题描述
每当我想在 R 中做一些映射"py 时,我通常会尝试使用 apply
家族中的函数.
然而,我从来没有完全理解它们之间的区别——如何{sapply
、lapply
等}将函数应用于输入/分组输入,输出会是什么样子,甚至输入会是什么样子——所以我经常把它们全部看完,直到得到我想要的东西.
谁能解释一下什么时候用哪一个?
我目前(可能不正确/不完整)的理解是......
sapply(vec, f)
:输入是一个向量.输出是一个向量/矩阵,其中元素i
是f(vec[i])
,如果f
有一个多元素输出lapply(vec, f)
:和sapply
一样,但是输出的是一个列表?apply(matrix, 1/2, f)
:输入是一个矩阵.输出是一个向量,其中元素i
是 f(矩阵的行/列 i)tapply(vector, grouping, f)
:输出是一个矩阵/数组,其中矩阵/数组中的一个元素是f
在一个分组处的值<向量的 code>g,并且g
被推送到行/列名称by(dataframe, grouping, f)
:让g
成为一个分组.将f
应用于组/数据框的每一列.漂亮地在每一列打印分组和f
的值.aggregate(matrix, grouping, f)
:类似于by
,但不是漂亮地打印输出,聚合将所有内容粘贴到数据帧中.
附带问题:我还没有学习 plyr 或 reshape -- plyr
或 reshape
会完全取代所有这些吗?
R 有许多 *apply 函数,它们在帮助文件(例如 ?apply
)中有很好的描述.然而,它们已经足够多,以至于初学者可能难以决定哪一个适合他们的情况,甚至难以记住所有这些.他们可能普遍认为我应该在这里使用 *apply 函数",但一开始很难让它们保持一致.
尽管(在其他答案中指出)*apply 系列的大部分功能都包含在极受欢迎的 plyr
包中,但基本函数仍然有用且值得了解.
此答案旨在充当新用户的路标,以帮助引导他们使用正确的 *apply 功能解决他们的特定问题.请注意,这不是旨在简单地反刍或替换 R 文档!希望这个答案可以帮助您决定哪个 *apply 函数适合您的情况,然后由您来进一步研究.除了一个例外,性能差异将不会得到解决.
apply - 当您想将函数应用于行或列时一个矩阵(和更高维的类似物);通常不建议用于数据帧,因为它会首先强制转换为矩阵.
# 二维矩阵M <- 矩阵(seq(1,16), 4, 4)# 将 min 应用于行申请(M,1,分钟)[1] 1 2 3 4# 将最大值应用于列应用(M,2,最大)[1] 4 8 12 16# 3 维数组M <- 数组(seq(32),dim = c(4,4,2))# 对每个 M[*, , ] 应用 sum - 即跨第 2 和第 3 维的总和应用(M,1,总和)# 结果是一维的[1] 120 128 136 144# 对每个 M[*, *, ] 应用求和 - 即跨第三维求和应用(M,c(1,2),总和)# 结果是二维的[,1] [,2] [,3] [,4][1,] 18 26 34 42[2,] 20 28 36 44[3,] 22 30 38 46[4,] 24 32 40 48
如果您想要二维矩阵的行/列均值或总和,请确保研究高度优化、闪电般快速的
colMeans
,rowMeans
、colSums
、rowSums
.lapply - 当你想将一个函数应用到一个依次列出并返回列表.
这是许多其他 *apply 函数的主力.剥支持他们的代码,你会经常在下面找到
lapply
.x <- list(a = 1, b = 1:3, c = 10:100)lapply(x,乐趣=长度)$a[1] 1$b[1] 3$c[1] 91lapply(x, FUN = sum)$a[1] 1$b[1] 6$c[1] 5005
sapply - 当你想将一个函数应用到一个依次列表,但您想要返回向量,而不是列表.
如果您发现自己键入了
unlist(lapply(...))
,请停下来考虑一下应用
.x <- list(a = 1, b = 1:3, c = 10:100)# 与上面比较;命名向量,而不是列表sapply(x, FUN = 长度)a b c1 3 91sapply(x, FUN = sum)a b c1 6 5005
在
sapply
的更高级用法中,它将尝试强制如果合适,结果为多维数组.例如,如果我们的函数返回相同长度的向量,sapply
将使用它们作为矩阵的列:sapply(1:5,function(x) rnorm(3,x))
如果我们的函数返回一个二维矩阵,
sapply
将做本质上相同的事情,将每个返回的矩阵视为单个长向量:sapply(1:5,function(x) matrix(x,2,2))
除非我们指定
simplify = "array"
,否则它将使用单个矩阵来构建多维数组:sapply(1:5,function(x) matrix(x,2,2),simple = "array")
这些行为中的每一个当然取决于我们的函数返回的向量或矩阵的相同长度或维度.
vapply - 当您想使用
sapply
但可能需要从您的代码中挤出更多速度或想要更多类型安全.对于
vapply
,你基本上给了R一个什么样的例子您的函数将返回,这可以节省一些时间强制返回值以适合单个原子向量.x <- list(a = 1, b = 1:3, c = 10:100)#注意因为这里的优势主要是速度,所以这个# 示例仅用于说明.我们告诉 R# length() 返回的所有内容都应该是整数# 长度 1.vapply(x, FUN = 长度, FUN.VALUE = 0L)a b c1 3 91
mapply - 当您有多个数据结构(例如向量、列表)并且您想将函数应用于第一个元素每个的,然后是每个的第二个元素,等等,强制结果到
sapply
中的向量/数组.这是多元的,你的函数必须接受多个参数.
#对第一个元素、第二个元素等进行求和.mapply(sum, 1:5, 1:5, 1:5)[1] 3 6 9 12 15#做rep(1,4)、rep(2,3)等mapply(rep, 1:4, 4:1)[[1]][1] 1 1 1 1[[2]][1] 2 2 2[[3]][1] 3 3[[4]][1] 4
Map -
mapply
的包装器,带有SIMPLIFY = FALSE
,因此可以保证返回一个列表.Map(sum, 1:5, 1:5, 1:5)[[1]][1] 3[[2]][1] 6[[3]][1] 9[[4]][1] 12[[5]][1] 15
rapply - 当您想递归地将函数应用于嵌套列表结构的每个元素时.>
为了让您了解
rapply
是多么不常见,我在第一次发布此答案时忘记了它!显然,我相信很多人都在使用它,但是 YMMV.rapply
最好用用户定义的函数来说明:# 追加!到字符串,否则递增myFun <- 函数(x){if(is.character(x)){返回(粘贴(x,!",sep ="))}别的{返回(x + 1)}}#嵌套列表结构l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),b = 3, c = "Yikes",d = 列表(a2 = 1, b2 = 列表(a3 = 嘿", b3 = 5)))# 结果被命名为向量,被强制为字符rapply(l, myFun)# 结果是一个类似于 l 的嵌套列表,其值已更改rapply(l, myFun, how="replace")
tapply - 当你想将一个函数应用到一个函数的子集时向量和子集由一些其他向量定义,通常是因素.
*apply 家族的害群之马.帮助文件的使用短语参差不齐的阵列"可能有点令人困惑,但实际上是很简单.
一个向量:
x <- 1:20
定义组的因素(长度相同!):
y <- factor(rep(letters[1:5], each = 4))
将
y
定义的每个子组内x
中的值相加:tapply(x, y, sum)a b c d10 26 42 58 74
可以在定义子组的情况下处理更复杂的示例通过一系列因素的独特组合.
tapply
是在精神上类似于 split-apply-combine 功能在 R 中很常见(aggregate
、by
、ave
、ddply
等)因此它的黑羊状态.
Whenever I want to do something "map"py in R, I usually try to use a function in the apply
family.
However, I've never quite understood the differences between them -- how {sapply
, lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputlapply(vec, f)
: same assapply
, but output is a list?apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix)tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col namesby(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column.aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr
or reshape
replace all of these entirely?
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code or want more type safety.For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:tapply(x, y, sum) a b c d e 10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.
这篇关于分组函数(tapply、by、aggregate)和 *apply 系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!