分组函数(tapply、by、aggregate)和 *apply 系列 [英] Grouping functions (tapply, by, aggregate) and the *apply family

查看:23
本文介绍了分组函数(tapply、by、aggregate)和 *apply 系列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当我想在 R 中做一些映射"py 时,我通常会尝试使用 apply 家族中的函数.

然而,我从来没有完全理解它们之间的区别——如何{sapplylapply等}将函数应用于输入/分组输入,输出会是什么样子,甚至输入会是什么样子——所以我经常把它们全部看完,直到得到我想要的东西.

谁能解释一下什么时候用哪一个?

我目前(可能不正确/不完整)的理解是......

  1. sapply(vec, f):输入是一个向量.输出是一个向量/矩阵,其中元素 if(vec[i]),如果 f 有一个多元素输出

  2. lapply(vec, f):和sapply一样,但是输出的是一个列表?

  3. apply(matrix, 1/2, f):输入是一个矩阵.输出是一个向量,其中元素 i 是 f(矩阵的行/列 i)
  4. tapply(vector, grouping, f):输出是一个矩阵/数组,其中矩阵/数组中的一个元素是f在一个分组处的值<向量的 code>g,并且 g 被推送到行/列名称
  5. by(dataframe, grouping, f):让 g 成为一个分组.将 f 应用于组/数据框的每一列.漂亮地在每一列打印分组和 f 的值.
  6. aggregate(matrix, grouping, f):类似于by,但不是漂亮地打印输出,聚合将所有内容粘贴到数据帧中.

附带问题:我还没有学习 plyr 或 reshape -- plyrreshape 会完全取代所有这些吗?

解决方案

R 有许多 *apply 函数,它们在帮助文件(例如 ?apply)中有很好的描述.然而,它们已经足够多,以至于初学者可能难以决定哪一个适合他们的情况,甚至难以记住所有这些.他们可能普遍认为我应该在这里使用 *apply 函数",但一开始很难让它们保持一致.

尽管(在其他答案中指出)*apply 系列的大部分功能都包含在极受欢迎的 plyr 包中,但基本函数仍然有用且值得了解.

此答案旨在充当新用户的路标,以帮助引导他们使用正确的 *apply 功能解决他们的特定问题.请注意,这不是旨在简单地反刍或替换 R 文档!希望这个答案可以帮助您决定哪个 *apply 函数适合您的情况,然后由您来进一步研究.除了一个例外,性能差异将不会得到解决.

  • apply - 当您想将函数应用于行或列时一个矩阵(和更高维的类似物);通常不建议用于数据帧,因为它会首先强制转换为矩阵.

     # 二维矩阵M <- 矩阵(seq(1,16), 4, 4)# 将 min 应用于行申请(M,1,分钟)[1] 1 2 3 4# 将最大值应用于列应用(M,2,最大)[1] 4 8 12 16# 3 维数组M <- 数组(seq(32),dim = c(4,4,2))# 对每个 M[*, , ] 应用 sum - 即跨第 2 和第 3 维的总和应用(M,1,总和)# 结果是一维的[1] 120 128 136 144# 对每个 M[*, *, ] 应用求和 - 即跨第三维求和应用(M,c(1,2),总和)# 结果是二维的[,1] [,2] [,3] [,4][1,] 18 26 34 42[2,] 20 28 36 44[3,] 22 30 38 46[4,] 24 32 40 48

    如果您想要二维矩阵的行/列均值或总和,请确保研究高度优化、闪电般快速的 colMeansrowMeanscolSumsrowSums.

  • lapply - 当你想将一个函数应用到一个依次列出并返回列表.

    这是许多其他 *apply 函数的主力.剥支持他们的代码,你会经常在下面找到 lapply.

     x <- list(a = 1, b = 1:3, c = 10:100)lapply(x,乐趣=长度)$a[1] 1$b[1] 3$c[1] 91lapply(x, FUN = sum)$a[1] 1$b[1] 6$c[1] 5005

  • sapply - 当你想将一个函数应用到一个依次列表,但您想要返回向量,而不是列表.

    如果您发现自己键入了 unlist(lapply(...)),请停下来考虑一下应用.

     x <- list(a = 1, b = 1:3, c = 10:100)# 与上面比较;命名向量,而不是列表sapply(x, FUN = 长度)a b c1 3 91sapply(x, FUN = sum)a b c1 6 5005

    sapply 的更高级用法中,它将尝试强制如果合适,结果为多维数组.例如,如果我们的函数返回相同长度的向量,sapply 将使用它们作为矩阵的列:

     sapply(1:5,function(x) rnorm(3,x))

    如果我们的函数返回一个二维矩阵,sapply 将做本质上相同的事情,将每个返回的矩阵视为单个长向量:

     sapply(1:5,function(x) matrix(x,2,2))

    除非我们指定simplify = "array",否则它将使用单个矩阵来构建多维数组:

     sapply(1:5,function(x) matrix(x,2,2),simple = "array")

    这些行为中的每一个当然取决于我们的函数返回的向量或矩阵的相同长度或维度.

  • vapply - 当您想使用 sapply 但可能需要从您的代码中挤出更多速度或想要更多类型安全.

    对于vapply,你基本上给了R一个什么样的例子您的函数将返回,这可以节省一些时间强制返回值以适合单个原子向量.

     x <- list(a = 1, b = 1:3, c = 10:100)#注意因为这里的优势主要是速度,所以这个# 示例仅用于说明.我们告诉 R# length() 返回的所有内容都应该是整数# 长度 1.vapply(x, FUN = 长度, FUN.VALUE = 0L)a b c1 3 91

  • ma​​pply - 当您有多个数据结构(例如向量、列表)并且您想将函数应用于第一个元素每个的,然后是每个的第二个元素,等等,强制结果到 sapply 中的向量/数组.

    这是多元的,你的函数必须接受多个参数.

     #对第一个元素、第二个元素等进行求和.mapply(sum, 1:5, 1:5, 1:5)[1] 3 6 9 12 15#做rep(1,4)、rep(2,3)等mapply(rep, 1:4, 4:1)[[1]][1] 1 1 1 1[[2]][1] 2 2 2[[3]][1] 3 3[[4]][1] 4

  • Map - mapply 的包装器,带有 SIMPLIFY = FALSE,因此可以保证返回一个列表.

     Map(sum, 1:5, 1:5, 1:5)[[1]][1] 3[[2]][1] 6[[3]][1] 9[[4]][1] 12[[5]][1] 15

  • rapply - 当您想递归地将函数应用于嵌套列表结构的每个元素时.>

    为了让您了解 rapply 是多么不常见,我在第一次发布此答案时忘记了它!显然,我相信很多人都在使用它,但是 YMMV.rapply 最好用用户定义的函数来说明:

     # 追加!到字符串,否则递增myFun <- 函数(x){if(is.character(x)){返回(粘贴(x,!",sep ="))}别的{返回(x + 1)}}#嵌套列表结构l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),b = 3, c = "Yikes",d = 列表(a2 = 1, b2 = 列表(a3 = 嘿", b3 = 5)))# 结果被命名为向量,被强制为字符rapply(l, myFun)# 结果是一个类似于 l 的嵌套列表,其值已更改rapply(l, myFun, how="replace")

  • tapply - 当你想将一个函数应用到一个函数的子集时向量和子集由一些其他向量定义,通常是因素.

    *apply 家族的害群之马.帮助文件的使用短语参差不齐的阵列"可能有点令人困惑,但实际上是很简单.

    一个向量:

     x <- 1:20

    定义组的因素(长度相同!):

     y <- factor(rep(letters[1:5], each = 4))

    y定义的每个子组内x中的值相加:

     tapply(x, y, sum)a b c d10 26 42 58 74

    可以在定义子组的情况下处理更复杂的示例通过一系列因素的独特组合.tapply 是在精神上类似于 split-apply-combine 功能在 R 中很常见(aggregatebyaveddply 等)因此它的黑羊状态.

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

Can someone explain how to use which one when?

My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output

  2. lapply(vec, f): same as sapply, but output is a list?

  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

解决方案

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

     # Two dimensional matrix
     M <- matrix(seq(1,16), 4, 4)
    
     # apply min to rows
     apply(M, 1, min)
     [1] 1 2 3 4
    
     # apply max to columns
     apply(M, 2, max)
     [1]  4  8 12 16
    
     # 3 dimensional array
     M <- array( seq(32), dim = c(4,4,2))
    
     # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
     apply(M, 1, sum)
     # Result is one-dimensional
     [1] 120 128 136 144
    
     # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
     apply(M, c(1,2), sum)
     # Result is two-dimensional
          [,1] [,2] [,3] [,4]
     [1,]   18   26   34   42
     [2,]   20   28   36   44
     [3,]   22   30   38   46
     [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

     x <- list(a = 1, b = 1:3, c = 10:100) 
     lapply(x, FUN = length) 
     $a 
     [1] 1
     $b 
     [1] 3
     $c 
     [1] 91
     lapply(x, FUN = sum) 
     $a 
     [1] 1
     $b 
     [1] 6
     $c 
     [1] 5005
    

  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

     x <- list(a = 1, b = 1:3, c = 10:100)
     # Compare with above; a named vector, not a list 
     sapply(x, FUN = length)  
     a  b  c   
     1  3 91
    
     sapply(x, FUN = sum)   
     a    b    c    
     1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

     sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

     sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

     sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety.

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

     x <- list(a = 1, b = 1:3, c = 10:100)
     #Note that since the advantage here is mainly speed, this
     # example is only for illustration. We're telling R that
     # everything returned by length() should be an integer of 
     # length 1. 
     vapply(x, FUN = length, FUN.VALUE = 0L) 
     a  b  c  
     1  3 91
    

  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    This is multivariate in the sense that your function must accept multiple arguments.

     #Sums the 1st elements, the 2nd elements, etc. 
     mapply(sum, 1:5, 1:5, 1:5) 
     [1]  3  6  9 12 15
     #To do rep(1,4), rep(2,3), etc.
     mapply(rep, 1:4, 4:1)   
     [[1]]
     [1] 1 1 1 1
    
     [[2]]
     [1] 2 2 2
    
     [[3]]
     [1] 3 3
    
     [[4]]
     [1] 4
    

  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

     Map(sum, 1:5, 1:5, 1:5)
     [[1]]
     [1] 3
    
     [[2]]
     [1] 6
    
     [[3]]
     [1] 9
    
     [[4]]
     [1] 12
    
     [[5]]
     [1] 15
    

  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

     # Append ! to string, otherwise increment
     myFun <- function(x){
         if(is.character(x)){
           return(paste(x,"!",sep=""))
         }
         else{
           return(x + 1)
         }
     }
    
     #A nested list structure
     l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
               b = 3, c = "Yikes", 
               d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
     # Result is named vector, coerced to character          
     rapply(l, myFun)
    
     # Result is a nested list like l, with values altered
     rapply(l, myFun, how="replace")
    

  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    A vector:

     x <- 1:20
    

    A factor (of the same length!) defining groups:

     y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

     tapply(x, y, sum)  
      a  b  c  d  e  
     10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

这篇关于分组函数(tapply、by、aggregate)和 *apply 系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆