分组功能(tapply,by,aggregate)和* apply系列 [英] Grouping functions (tapply, by, aggregate) and the *apply family
问题描述
每当我想在R中做"map" py时,我通常会尝试使用apply
系列中的函数.
但是,我从未完全理解它们之间的区别-{sapply
,lapply
等}如何将函数应用于输入/分组输入,输出将是什么样,甚至是什么输入可以是-所以我经常只是遍历所有这些,直到获得所需的内容.
有人可以解释何时使用哪一个吗?
我目前的理解(可能是错误/不完整)是...
-
sapply(vec, f)
:输入是向量.输出是向量/矩阵,其中元素i
是f(vec[i])
,如果f
具有多元素输出,则会为您提供矩阵 -
lapply(vec, f)
:与sapply
相同,但输出是列表? -
apply(matrix, 1/2, f)
:输入是一个矩阵.输出是一个向量,其中元素i
是f(矩阵的行/列i) -
tapply(vector, grouping, f)
:输出是一个矩阵/数组,其中矩阵/数组中的元素是向量的分组g
处f
的值,而g
被压入行/列名称 -
by(dataframe, grouping, f)
:让g
为分组.将f
应用于组/数据框的每一列.漂亮地在每列上打印分组和f
的值. -
aggregate(matrix, grouping, f)
:与by
相似,但是aggregate不会将输出漂亮地打印出来,而是将所有内容粘贴到数据框中.
侧面问题:我仍然还没有学会plyr或重塑-plyr
或reshape
会完全取代所有这些吗?
R具有许多* apply函数,这些函数在帮助文件(例如?apply
)中进行了详细描述.但是,它们足够多,开始使用的用户可能很难决定哪个适合他们的情况,甚至难以记住所有情况.他们可能有一个普遍的感觉,即我应该在这里使用* apply函数",但是一开始很难将它们保持整齐.
尽管事实(在其他答案中也有提及),但* apply系列的许多功能已由极为流行的plyr
程序包所涵盖,但基本功能仍然有用并且值得了解.
此答案旨在用作新用户的路标,以帮助将其定向到针对其特定问题的正确* apply函数.请注意,这不是不,仅是为了反省或替换R文档!希望这个答案可以帮助您确定哪个* apply功能适合您的情况,然后由您自己进行进一步的研究.除了一个例外,性能差异将无法解决.
-
应用-要将函数应用于行或列时 矩阵(和高维类似物);通常不建议使用数据帧,因为它将首先强制转换为矩阵.
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
如果您要2D矩阵的行/列均值或总和,请确保 研究高度优化的闪电般迅速的
colMeans
,rowMeans
,colSums
,rowSums
. -
应用-要将函数应用于对象的每个元素 依次列出并重新获得列表.
这是许多其他* apply函数的主力.剥 返回他们的代码,您通常会在下面找到
lapply
.x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
-
应用-要将函数应用于对象的每个元素时 依次列出,但您想要返回 vector ,而不是列表.
如果您键入
unlist(lapply(...))
,请停止并考虑sapply
.x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
在
sapply
的更高级用法中,它将尝试强制 如果适用,将结果转换为多维数组.例如,如果我们的函数返回相同长度的向量,则sapply
会将它们用作矩阵的列:sapply(1:5,function(x) rnorm(3,x))
如果我们的函数返回二维矩阵,则
sapply
会做基本上相同的事情,将每个返回的矩阵视为单个长向量:sapply(1:5,function(x) matrix(x,2,2))
除非我们指定
simplify = "array"
,否则在这种情况下它将使用各个矩阵构建多维数组:sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
这些行为中的每一个当然取决于我们返回相同长度或维数的向量或矩阵的函数.
-
vapply -当您要使用
sapply
但可能需要使用 加快代码速度.对于
vapply
,您基本上给R举例说明了什么样的事情 您的函数将返回,这可以节省一些强制返回的时间 值以适合单个原子向量.x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
-
应用-用于具有多个数据结构(例如 向量,列表),并且您想将功能应用于第一个元素 的每个,然后是每个的第二个元素,等等,将结果强制 到
sapply
中的向量/数组.从某种意义上说,这是多变量,您的函数必须接受 多个参数.
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
-
地图-使用
SIMPLIFY = FALSE
封装到mapply
的包装,因此可以保证返回列表.Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
-
应用-当您要递归将函数应用于嵌套列表结构的每个元素时. >
为了让您了解
rapply
的不常见之处,我在首次发布此答案时就忘记了它!显然,我敢肯定会有很多人使用它,但是YMMV.rapply
最好用一个用户定义的函数来说明:# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
-
套用-用于要将功能应用于广告的子集的情况 向量,子集由其他向量定义,通常是 因素.
* apply家族的败类.帮助文件的使用 短语参差不齐的数组"可能有点令人困惑,但实际上 很简单.
向量:
x <- 1:20
定义组的因素(长度相同!)
y <- factor(rep(letters[1:5], each = 4))
在
y
定义的每个子组中,将x
中的值相加:tapply(x, y, sum) a b c d e 10 26 42 58 74
在定义子组的地方可以处理更复杂的示例 由几个因素组成的清单的独特组合.
tapply
是 在本质上类似于拆分应用合并函数 R中常见的(aggregate
,by
,ave
,ddply
等) 败类状态.
Whenever I want to do something "map"py in R, I usually try to use a function in the apply
family.
However, I've never quite understood the differences between them -- how {sapply
, lapply
, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.
Can someone explain how to use which one when?
My current (probably incorrect/incomplete) understanding is...
sapply(vec, f)
: input is a vector. output is a vector/matrix, where elementi
isf(vec[i])
, giving you a matrix iff
has a multi-element outputlapply(vec, f)
: same assapply
, but output is a list?apply(matrix, 1/2, f)
: input is a matrix. output is a vector, where elementi
is f(row/col i of the matrix)tapply(vector, grouping, f)
: output is a matrix/array, where an element in the matrix/array is the value off
at a groupingg
of the vector, andg
gets pushed to the row/col namesby(dataframe, grouping, f)
: letg
be a grouping. applyf
to each column of the group/dataframe. pretty print the grouping and the value off
at each column.aggregate(matrix, grouping, f)
: similar toby
, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
Side question: I still haven't learned plyr or reshape -- would plyr
or reshape
replace all of these entirely?
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick
colMeans
,rowMeans
,colSums
,rowSums
.lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel back their code and you will often find
lapply
underneath.x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing
unlist(lapply(...))
, stop and considersapply
.x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005
In more advanced uses of
sapply
it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length,sapply
will use them as columns of a matrix:sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix,
sapply
will do essentially the same thing, treating each returned matrix as a single long vector:sapply(1:5,function(x) matrix(x,2,2))
Unless we specify
simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use
sapply
but perhaps need to squeeze some more speed out of your code.For
vapply
, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91
mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in
sapply
.This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4
Map - A wrapper to
mapply
withSIMPLIFY = FALSE
, so it is guaranteed to return a list.Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon
rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV.rapply
is best illustrated with a user-defined function to apply:# Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace")
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in
x
within each subgroup defined byy
:tapply(x, y, sum) a b c d e 10 26 42 58 74
More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors.
tapply
is similar in spirit to the split-apply-combine functions that are common in R (aggregate
,by
,ave
,ddply
, etc.) Hence its black sheep status.
这篇关于分组功能(tapply,by,aggregate)和* apply系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!