使用规模上的唯一值(for循环,apply或plyr) [英] Working with unique values at scale (for loops, apply, or plyr)

查看:125
本文介绍了使用规模上的唯一值(for循环,apply或plyr)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道这是否可行,但如果是这样,它会使生活更加高效。

I'm not sure if this is possible, but if it is, it would make life oh so much more efficient.

对于更广泛的社区,一般的问题将会很有趣:for循环(和基本功能,如apply)适用于一般/一致的操作,例如将X添加到每个数据帧的列或行。我想要执行一般/一致的操作,但是对于数据帧的每个元素都具有唯一的值。

The general problem that would be interesting to the wider SO community: for loops (and base functions like apply) are applicable for general/consistent operations, like adding X to every column or row of a data frame. I have a general/consistent operation I want to carry out, but with unique values for each element of the data frame.

有没有办法比每个分组的子集化数据框更有效,应用相对于该分组的特定数字的函数,然后重新组合?我不在乎它是一个 for循环应用,但是如果使用 plyr 功能,则是积分。

Is there a way to do this more efficiently than subsetting my data frame for every grouping, applying the function with specific numbers relative to that grouping, then recombining? I don't care if it's a for loop or apply, but bonus points if it makes use of plyr functionality.

以下是我正在处理的更具体的问题:我有以下数据。最终我想要的是具有日期的时间序列的数据框,每列表示一个区域与一些基准的关系。

Here's the more specific problem I'm working on: I've got the data below. Ultimately what I want is a dataframe for time-series that has the date, and each column represents a region's relation to some benchmark.

问题:每个地区的兴趣度量是不同的,基准也是如此。这是数据:

The problem: the measure of interest for each region is different, and so is the benchmark. Here's the data:

library(dplyr)
library(reshape2)

data <- data.frame(
    region = sample(c("northeast","midwest","west"), 100, replace = TRUE),
    date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),10),
    population = sample(50000:100000, 10, replace = T),
    skiers = sample(1:100),
    bearsfans = sample(1:100),
    dudes = sample(1:100)
)

和我正在处理的摘要框架:

and the summary frame that I'm working off:

data2 <- data %.%
    group_by(date, region) %.%
    summarise(skiers = sum(skiers), 
            bearsfans= sum(bearsfans), 
            dudes = sum(dudes), 
            population = sum(population)) %.%
    mutate(ppl_per_skier = population/skiers,
            ppl_per_bearsfan = population/bearsfans,
            ppl_per_dude = population/dudes) %.%
    select(date, region, ppl_per_skier, ppl_per_bearsfan , ppl_per_dude)

这是棘手的部分:


  • 对于东北,我只关心ppl_per_skier,基准是3500

  • 对于中西部地区,我只关心ppl_per_bearsfan而基准是1200

  • 对于西方,我只关心ppl_per_dude,基准是5000

我提出的任何解决这个问题的方法都涉及为每个措施创建子集,但是按照数百个措施和不同的基准进行规模化是不理想的。例如:

Any of the ways I've come up with to tackle this problem involve creating subsets for each measure, but doing this at scale with hundreds of measures and different benchmarks is... not ideal. For example:

midwest <- data2 %.% 
            filter(region == "midwest") %.%
            select(date, region, ppl_per_bearsfan) %.%
            mutate(bmark = 1200, against_bmk = bmark/ppl_per_bearsfan-1) %.%
            select(date, against_bmk)

同样针对每个区域,其各自的度量及其各自的基准,然后按日期将它们重新组合在一起。最终,我想要这样的东西,每个地区的表现与其具体的基准和度量是按照日期排列的(当然这是假的数据):

and likewise for each region, its respective measure, and its respective benchmark, then recombining them all together by date. Ultimately, I want something like this, where each region's performance against its specific benchmark and measure is laid out by date (this is fake data, of course):

        date midwest_againstbmk northeast_againstbmk west_againstbmk
1 2010-02-10          0.9617402            0.6008032       0.3403260
2 2010-02-11          0.5808621            0.5119942       0.7787559
3 2010-02-12          0.4828346            0.6560053       0.3747920
4 2010-02-13          0.6499841            0.7567194       0.8387461
5 2010-02-14          0.6367520            0.4564254       0.7269161

有没有办法得到这种数据和结构,而不必为每个分组做X个子集,当我有独特的度量和每个组的基准值?

Is there a way to get to this sort of data and structure without having to do X number of subsets for each grouping, when I have unique measures and benchmark values for each group?

推荐答案

似乎是一个明显的用例, mapply

Seems like an obvious use case for mapply:

> mapply(function(d,y,b) {(b/d[,y])-1},
         split(data2,data2$region), 
         c('ppl_per_bearsfan','ppl_per_skier','ppl_per_dude'), 
         c(1200,3500,5000))
          midwest   northeast      west
 [1,] -0.26625428 -0.02752186 3.5881957
 [2,]  0.48715638  1.89169295 2.6928546
 [3,] -0.94222992  1.26065537 4.0388343
 [4,] -0.38116663  0.79572184 1.4118364
 [5,] -0.05937874  2.05459482 1.8822015
 [6,] -0.41463925  1.60668461 1.5914408
 [7,] -0.31211391  1.21093777 2.7517886
 [8,] -0.88923466  0.44917981 1.2251965
 [9,] -0.02781965 -0.24637182 2.7143103
[10,] -0.46643682  1.28944776 0.6246315

这篇关于使用规模上的唯一值(for循环,apply或plyr)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆