R中的有效函数编程(使用mapply)用于“自然”程序问题 [英] Efficient functional programming (using mapply) in R for a "naturally" procedural problem

查看:85
本文介绍了R中的有效函数编程(使用mapply)用于“自然”程序问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中一个常见的用例(至少对我来说)是在数据框架中标识观察值,这些观察值具有一些取决于其他观测值子集中的值的特征。

为了让这个更具体化,假设我有一些worker(由WorkerId索引),
有一个关联的Iteration:

  raw<  -  data.frame(WorkerId = c(1,1,1,1,2,2,2,2,3,3,3,3),
Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))

,我希望最终为数据框子集,以排除每个工作人员的最后迭代(通过创建删除布尔值)。我可以编写一个函数来执行此操作:

  raw $ remove<  -  mapply(function(wid,iter){
iter == max(raw $ Iteration [raw $ WorkerId == wid])},
raw $ WorkerId,raw $ Iteration)

>原料$删除
[1]假否假真假假假真假假假真
< >但是,随着数据帧变大(这可能是因为我不必要地计算每个观察的最大值),这会变得非常缓慢。


我的问题是在函数式编程风格中做到这一点的更高效(和惯用的)方法是什么。它是首先创建一个WorkerId到最大值字典,然后将其用作另一个对每个观察操作的函数的参数? 解决方案<最自然的方式> IMO是split-lapply-rbind方法。你首先将split()分成一个组列表,然后lapply()处理规则(在这种情况下删除最后一行),然后rbind()返回到一起。这一切都可以作为嵌套函数调用进行操作。内部的两个步骤在这里进行说明,最后一行在底部:

 > lapply(split(raw,raw $ WorkerId),function(x)x [-NROW(x),])
$`1`
WorkerId迭代
1 1 1
2 1 2
3 1 3

$`2`
工人Id迭代
5 2 1
6 2 2
7 2 3

$`3`
WorkerId迭代
9 3 1
10 3 2
11 3 3

do.call( rbind,lapply(split(raw,raw $ WorkerId),function(x)x [-NROW(x),]))



Hadley Wickham开发了一系列工具, plyr 软件包,将这一策略扩展到更广泛的任务。


A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.

To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that have an associated "Iteration":

    raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
              Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))

and I want to eventually subset the data frame to exclude the "last" iteration (by creating a "remove" boolean) for each worker. I can write a function to do this:

raw$remove <- mapply(function(wid,iter){
                              iter==max(raw$Iteration[raw$WorkerId==wid])},
                 raw$WorkerId, raw$Iteration)

> raw$remove
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

but this gets very slow as the data frame gets larger (presumably because I'm needlessly computing the max for every observation).

My question is what's the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?

解决方案

The "most natural way" IMO is the split-lapply-rbind method. You start by split()-ting into a list of groups, then lapply() the processing rule (in this case removing the last row) and then rbind() them back together. It's all doable as a nested set of function calls. The inner two steps are illustrated here and the final one-liner is presented at the bottom:

> lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] )
$`1`
  WorkerId Iteration
1        1         1
2        1         2
3        1         3

$`2`
  WorkerId Iteration
5        2         1
6        2         2
7        2         3

$`3`
   WorkerId Iteration
9         3         1
10        3         2
11        3         3

do.call(rbind,  lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] ) ) 

Hadley Wickham has developed a wide set of tools, the plyr package, that extend this strategy to a wider variety of tasks.

这篇关于R中的有效函数编程(使用mapply)用于“自然”程序问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆