大规模使用唯一值(for loops、apply 或 plyr) [英] Working with unique values at scale (for loops, apply, or plyr)

查看:21
本文介绍了大规模使用唯一值(for loops、apply 或 plyr)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定这是否可行,但如果可行,它会让生活变得更有效率.

I'm not sure if this is possible, but if it is, it would make life oh so much more efficient.

更广泛的 SO 社区可能会感兴趣的一般问题:for 循环(以及诸如 apply 之类的基本函数)适用于一般/一致操作,例如向数据框的每一列或每一行添加 X.我有一个想要执行的通用/一致操作,但是数据框的每个元素都有唯一 值.

The general problem that would be interesting to the wider SO community: for loops (and base functions like apply) are applicable for general/consistent operations, like adding X to every column or row of a data frame. I have a general/consistent operation I want to carry out, but with unique values for each element of the data frame.

有没有比为每个分组设置我的数据框子集、应用具有相对于该分组的特定数字的函数,然后重新组合更有效的方法?我不在乎它是 for 循环 还是 apply,但如果它使用了 plyr 功能,我会加分.

Is there a way to do this more efficiently than subsetting my data frame for every grouping, applying the function with specific numbers relative to that grouping, then recombining? I don't care if it's a for loop or apply, but bonus points if it makes use of plyr functionality.

这是我正在处理的更具体的问题:我有以下数据.最终我想要的是一个具有日期的时间序列数据框,每列代表一个区域与某个基准的关系.

Here's the more specific problem I'm working on: I've got the data below. Ultimately what I want is a dataframe for time-series that has the date, and each column represents a region's relation to some benchmark.

问题:每个区域的兴趣度量不同,基准也不同.数据如下:

The problem: the measure of interest for each region is different, and so is the benchmark. Here's the data:

library(dplyr)
library(reshape2)

data <- data.frame(
    region = sample(c("northeast","midwest","west"), 100, replace = TRUE),
    date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),10),
    population = sample(50000:100000, 10, replace = T),
    skiers = sample(1:100),
    bearsfans = sample(1:100),
    dudes = sample(1:100)
)

以及我正在处理的摘要框架:

and the summary frame that I'm working off:

data2 <- data %.%
    group_by(date, region) %.%
    summarise(skiers = sum(skiers), 
            bearsfans= sum(bearsfans), 
            dudes = sum(dudes), 
            population = sum(population)) %.%
    mutate(ppl_per_skier = population/skiers,
            ppl_per_bearsfan = population/bearsfans,
            ppl_per_dude = population/dudes) %.%
    select(date, region, ppl_per_skier, ppl_per_bearsfan , ppl_per_dude)

这是棘手的部分:

  • 对于东北,我只关心ppl_per_skier",基准是3500
  • 对于中西部,我只关心ppl_per_bearsfan",基准是 1200
  • 对于西方,我只关心ppl_per_dude",基准是 5000

我想出的任何解决这个问题的方法都涉及为每个度量创建子集,但是用数百个度量和不同的基准大规模地这样做......并不理想.例如:

Any of the ways I've come up with to tackle this problem involve creating subsets for each measure, but doing this at scale with hundreds of measures and different benchmarks is... not ideal. For example:

midwest <- data2 %.% 
            filter(region == "midwest") %.%
            select(date, region, ppl_per_bearsfan) %.%
            mutate(bmark = 1200, against_bmk = bmark/ppl_per_bearsfan-1) %.%
            select(date, against_bmk)

对于每个区域,其各自的度量和各自的基准也是如此,然后按日期将它们重新组合在一起.最终,我想要这样的东西,其中每个地区相对于其特定基准和衡量标准的表现按日期排列(当然,这是假数据):

and likewise for each region, its respective measure, and its respective benchmark, then recombining them all together by date. Ultimately, I want something like this, where each region's performance against its specific benchmark and measure is laid out by date (this is fake data, of course):

        date midwest_againstbmk northeast_againstbmk west_againstbmk
1 2010-02-10          0.9617402            0.6008032       0.3403260
2 2010-02-11          0.5808621            0.5119942       0.7787559
3 2010-02-12          0.4828346            0.6560053       0.3747920
4 2010-02-13          0.6499841            0.7567194       0.8387461
5 2010-02-14          0.6367520            0.4564254       0.7269161

当我为每个组提供独特的度量和基准值时,有没有办法获得这种数据和结构,而不必为每个分组做 X 个子集?

Is there a way to get to this sort of data and structure without having to do X number of subsets for each grouping, when I have unique measures and benchmark values for each group?

推荐答案

似乎是 mapply 的一个明显用例:

Seems like an obvious use case for mapply:

> mapply(function(d,y,b) {(b/d[,y])-1},
         split(data2,data2$region), 
         c('ppl_per_bearsfan','ppl_per_skier','ppl_per_dude'), 
         c(1200,3500,5000))
          midwest   northeast      west
 [1,] -0.26625428 -0.02752186 3.5881957
 [2,]  0.48715638  1.89169295 2.6928546
 [3,] -0.94222992  1.26065537 4.0388343
 [4,] -0.38116663  0.79572184 1.4118364
 [5,] -0.05937874  2.05459482 1.8822015
 [6,] -0.41463925  1.60668461 1.5914408
 [7,] -0.31211391  1.21093777 2.7517886
 [8,] -0.88923466  0.44917981 1.2251965
 [9,] -0.02781965 -0.24637182 2.7143103
[10,] -0.46643682  1.28944776 0.6246315

这篇关于大规模使用唯一值(for loops、apply 或 plyr)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆