分组功能(tapply,by,aggregate)和* apply系列 [英] Grouping functions (tapply, by, aggregate) and the *apply family

查看:86
本文介绍了分组功能(tapply,by,aggregate)和* apply系列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当我想在R中做"map" py时,我通常会尝试使用apply系列中的函数.

但是,我从未完全理解它们之间的区别-{sapplylapply等}如何将函数应用于输入/分组输入,输出将是什么样,甚至是什么输入可以是-所以我经常只是遍历所有这些,直到获得所需的内容.

有人可以解释何时使用哪一个吗?

我目前的理解(可能是错误/不完整)是...

  1. sapply(vec, f):输入是向量.输出是向量/矩阵,其中元素if(vec[i]),如果f具有多元素输出,则会为您提供矩阵

  2. lapply(vec, f):与sapply相同,但输出是列表?

  3. apply(matrix, 1/2, f):输入是一个矩阵.输出是一个向量,其中元素i是f(矩阵的行/列i)
  4. tapply(vector, grouping, f):输出是一个矩阵/数组,其中矩阵/数组中的元素是向量的分组gf的值,而g被压入行/列名称
  5. by(dataframe, grouping, f):让g为分组.将f应用于组/数据框的每一列.漂亮地在每列上打印分组和f的值.
  6. aggregate(matrix, grouping, f):与by相似,但是aggregate不会将输出漂亮地打印出来,而是将所有内容粘贴到数据框中.

侧面问题:我仍然还没有学会plyr或重塑-plyrreshape会完全取代所有这些吗?

解决方案

R具有许多* apply函数,这些函数在帮助文件(例如?apply)中进行了详细描述.但是,它们足够多,开始使用的用户可能很难决定哪个适合他们的情况,甚至难以记住所有情况.他们可能有一个普遍的感觉,即我应该在这里使用* apply函数",但是一开始很难将它们保持整齐.

尽管事实(在其他答案中也有提及),但* apply系列的许多功能已由极为流行的plyr程序包所涵盖,但基本功能仍然有用并且值得了解.

此答案旨在用作新用户的路标,以帮助将其定向到针对其特定问题的正确* apply函数.请注意,这不是,仅是为了反省或替换R文档!希望这个答案可以帮助您确定哪个* apply功能适合您的情况,然后由您自己进行进一步的研究.除了一个例外,性能差异将无法解决.

  • 应用-要将函数应用于行或列时 矩阵(和高维类似物);通常不建议使用数据帧,因为它将首先强制转换为矩阵.

    # Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)
    
    # apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4
    
    # apply max to columns
    apply(M, 2, max)
    [1]  4  8 12 16
    
    # 3 dimensional array
    M <- array( seq(32), dim = c(4,4,2))
    
    # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)
    # Result is one-dimensional
    [1] 120 128 136 144
    
    # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)
    # Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48
    

    如果您要2D矩阵的行/列均值或总和,请确保 研究高度优化的闪电般迅速的colMeansrowMeanscolSumsrowSums.

  • 应用-要将函数应用于对象的每个元素 依次列出并重新获得列表.

    这是许多其他* apply函数的主力.剥 返回他们的代码,您通常会在下面找到lapply.

    x <- list(a = 1, b = 1:3, c = 10:100) 
    lapply(x, FUN = length) 
    $a 
    [1] 1
    $b 
    [1] 3
    $c 
    [1] 91
    lapply(x, FUN = sum) 
    $a 
    [1] 1
    $b 
    [1] 6
    $c 
    [1] 5005
    

  • 应用-要将函数应用于对象的每个元素时 依次列出,但您想要返回 vector ,而不是列表.

    如果您键入unlist(lapply(...)),请停止并考虑 sapply.

    x <- list(a = 1, b = 1:3, c = 10:100)
    # Compare with above; a named vector, not a list 
    sapply(x, FUN = length)  
    a  b  c   
    1  3 91
    
    sapply(x, FUN = sum)   
    a    b    c    
    1    6 5005 
    

    sapply的更高级用法中,它将尝试强制 如果适用,将结果转换为多维数组.例如,如果我们的函数返回相同长度的向量,则sapply会将它们用作矩阵的列:

    sapply(1:5,function(x) rnorm(3,x))
    

    如果我们的函数返回二维矩阵,则sapply会做基本上相同的事情,将每个返回的矩阵视为单个长向量:

    sapply(1:5,function(x) matrix(x,2,2))
    

    除非我们指定simplify = "array",否则在这种情况下它将使用各个矩阵构建多维数组:

    sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    这些行为中的每一个当然取决于我们返回相同长度或维数的向量或矩阵的函数.

  • vapply -当您要使用sapply但可能需要使用 加快代码速度.

    对于vapply,您基本上给R举例说明了什么样的事情 您的函数将返回,这可以节省一些强制返回的时间 值以适合单个原子向量.

    x <- list(a = 1, b = 1:3, c = 10:100)
    #Note that since the advantage here is mainly speed, this
    # example is only for illustration. We're telling R that
    # everything returned by length() should be an integer of 
    # length 1. 
    vapply(x, FUN = length, FUN.VALUE = 0L) 
    a  b  c  
    1  3 91
    

  • 应用-用于具有多个数据结构(例如 向量,列表),并且您想将功能应用于第一个元素 的每个,然后是每个的第二个元素,等等,将结果强制 到sapply中的向量/数组.

    从某种意义上说,这是多变量,您的函数必须接受 多个参数.

    #Sums the 1st elements, the 2nd elements, etc. 
    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15
    #To do rep(1,4), rep(2,3), etc.
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1
    
    [[2]]
    [1] 2 2 2
    
    [[3]]
    [1] 3 3
    
    [[4]]
    [1] 4
    

  • 地图-使用SIMPLIFY = FALSE封装到mapply的包装,因此可以保证返回列表.

    Map(sum, 1:5, 1:5, 1:5)
    [[1]]
    [1] 3
    
    [[2]]
    [1] 6
    
    [[3]]
    [1] 9
    
    [[4]]
    [1] 12
    
    [[5]]
    [1] 15
    

  • 应用-当您要递归将函数应用于嵌套列表结构的每个元素时.

    为了让您了解rapply的不常见之处,我在首次发布此答案时就忘记了它!显然,我敢肯定会有很多人使用它,但是YMMV. rapply最好用一个用户定义的函数来说明:

    # Append ! to string, otherwise increment
    myFun <- function(x){
        if(is.character(x)){
          return(paste(x,"!",sep=""))
        }
        else{
          return(x + 1)
        }
    }
    
    #A nested list structure
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
              b = 3, c = "Yikes", 
              d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
    # Result is named vector, coerced to character          
    rapply(l, myFun)
    
    # Result is a nested list like l, with values altered
    rapply(l, myFun, how="replace")
    

  • 套用-用于要将功能应用于广告的子集的情况 向量,子集由其他向量定义,通常是 因素.

    * apply家族的败类.帮助文件的使用 短语参差不齐的数组"可能有点令人困惑,但实际上 很简单.

    向量:

    x <- 1:20
    

    定义组的因素(长度相同!)

    y <- factor(rep(letters[1:5], each = 4))
    

    y定义的每个子组中,将x中的值相加:

    tapply(x, y, sum)  
     a  b  c  d  e  
    10 26 42 58 74 
    

    在定义子组的地方可以处理更复杂的示例 由几个因素组成的清单的独特组合. tapply是 在本质上类似于拆分应用合并函数 R中常见的(aggregatebyaveddply等) 败类状态.

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

Can someone explain how to use which one when?

My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output

  2. lapply(vec, f): same as sapply, but output is a list?

  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

解决方案

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

    # Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)
    
    # apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4
    
    # apply max to columns
    apply(M, 2, max)
    [1]  4  8 12 16
    
    # 3 dimensional array
    M <- array( seq(32), dim = c(4,4,2))
    
    # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)
    # Result is one-dimensional
    [1] 120 128 136 144
    
    # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)
    # Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

    x <- list(a = 1, b = 1:3, c = 10:100) 
    lapply(x, FUN = length) 
    $a 
    [1] 1
    $b 
    [1] 3
    $c 
    [1] 91
    lapply(x, FUN = sum) 
    $a 
    [1] 1
    $b 
    [1] 6
    $c 
    [1] 5005
    

  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

    x <- list(a = 1, b = 1:3, c = 10:100)
    # Compare with above; a named vector, not a list 
    sapply(x, FUN = length)  
    a  b  c   
    1  3 91
    
    sapply(x, FUN = sum)   
    a    b    c    
    1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

    sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

    sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

    sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

    x <- list(a = 1, b = 1:3, c = 10:100)
    #Note that since the advantage here is mainly speed, this
    # example is only for illustration. We're telling R that
    # everything returned by length() should be an integer of 
    # length 1. 
    vapply(x, FUN = length, FUN.VALUE = 0L) 
    a  b  c  
    1  3 91
    

  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    This is multivariate in the sense that your function must accept multiple arguments.

    #Sums the 1st elements, the 2nd elements, etc. 
    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15
    #To do rep(1,4), rep(2,3), etc.
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1
    
    [[2]]
    [1] 2 2 2
    
    [[3]]
    [1] 3 3
    
    [[4]]
    [1] 4
    

  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

    Map(sum, 1:5, 1:5, 1:5)
    [[1]]
    [1] 3
    
    [[2]]
    [1] 6
    
    [[3]]
    [1] 9
    
    [[4]]
    [1] 12
    
    [[5]]
    [1] 15
    

  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

    # Append ! to string, otherwise increment
    myFun <- function(x){
        if(is.character(x)){
          return(paste(x,"!",sep=""))
        }
        else{
          return(x + 1)
        }
    }
    
    #A nested list structure
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
              b = 3, c = "Yikes", 
              d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
    # Result is named vector, coerced to character          
    rapply(l, myFun)
    
    # Result is a nested list like l, with values altered
    rapply(l, myFun, how="replace")
    

  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    A vector:

    x <- 1:20
    

    A factor (of the same length!) defining groups:

    y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

    tapply(x, y, sum)  
     a  b  c  d  e  
    10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

这篇关于分组功能(tapply,by,aggregate)和* apply系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆