使用plyr计算大于95%的分位数时出现错误 [英] Error when calculating values greater than 95% quantile using plyr

查看:91
本文介绍了使用plyr计算大于95%的分位数时出现错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据结构如下:

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))

使用dplyr我通过以下代码应用滚动平均值(从2到4秒):

Using dplyr I apply a rolling average (from 2 to 4 seconds) via the following code:

for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}

我现在希望计算每个滚动平均值中每个ParticipantPower的前5%.为了计算这一点,我使用:

I now wish to calculate the top 5% of Power for each Participant across each of the rolling averages. To compute this, I use:

Output = ddply(Individ, .(Participant, Condition), summarise,
           TwoSec <- Rolling.mean.2 > quantile(Rolling.mean.2 , 0.95, na.rm = TRUE))

但是,我最后得到一列TRUEFALSE的列.相反,我追求的是前5%的实际值.我该怎么做呢?是否还有一种更简单的方法可以按参与者和条件遍历每个滚动平均值列,以找到每个平均值的前5%?

However, I end up with a column that states TRUE or FALSE. Instead, I am after the actual values that are in the top 5%. How do I do this? Is there also an easier way to iterate over each rolling mean column, by participant and condition, to find the top 5% in each?

谢谢!

推荐答案

最好获得滚动数据表,这使分位数的计算工作变得容易得多.

It's good that you got your rolling data table, that made the job of calculating the quantiles a lot easier.

第1步:按参加者,条件,位置分组

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Location = c("Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home", 
                                   "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))


library(dplyr)
library(zoo)
for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}


Individ


     Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)
1         Bill     1   Placebo     Home   400             NA             NA             NA
2         Bill     2   Placebo     Home   250            325             NA             NA
3         Bill     3   Placebo     Home   180            215       276.6667             NA
4         Bill     4   Placebo     Home   500            340       310.0000          332.5
5         Bill     1      Expr     Away   300            400       326.6667          307.5
6         Bill     2      Expr     Away   450            375       416.6667          357.5
7         Bill     3      Expr     Away   600            525       450.0000          462.5
8         Bill     4      Expr     Away   512            556       520.6667          465.5
9         Bill     1      Expr     Home   300            406       470.6667          465.5
10        Bill     2      Expr     Home   500            400       437.3333          478.0

在获得所有7或8列(此数据集包括位置)之后,它也回答了另一个问题,在新的Individ数据集中,这是我为解决您的问题所做的事情.我100%肯定有一种更干净,更有效的方法来执行此操作,但是这里有逻辑,应该可以输出.

After getting all 7 or 8 columns (this dataset includes Location), so it answers the other question as well, in the new Individ dataset, here's what I did to solve your problem. I'm 100% sure there is a cleaner and more efficient way to do this, but there is logic here and it should output fine.

第2步:获取组的分位数

library(plyr)
Individ[is.na(Individ)]<- 0
Top_percentiles <- ddply(Individ, 
                         c("Participant", "Condition", "Location"), 
                         summarise, 
                         Power2 = quantile(Rolling.mean.2, .95),
                         Power3 = quantile(Rolling.mean.3, .95),
                         Power4 = quantile(Rolling.mean.4, .95)
                         )

Top_percentiles

  Participant Condition Location  Power2   Power3  Power4
1        Bill      Expr     Away 551.350 510.0667 465.050
2        Bill      Expr     Home 464.650 465.6667 476.125
3        Bill   Placebo     Home 337.750 305.0000 282.625
4       Harry      Expr     Away 585.175 533.4000 485.425
5       Harry   Placebo     Home 322.150 280.7667 268.175
6        Paul      Expr     Home 556.500 556.5000 408.000

这是每个组的前5%的阈值和相应的滚动平均值.

which is the threshold for the top 5% for each group and the corresponding rolling averages.

现在剩下要做的就是计算数据集中每个阈值以上的观测值.

Now the only thing left to do is calculate the observations in your dataset that are above each threshold.

第3步:将滚动平均值列与原始数据集匹配

像这样的事情我正在修补.

Something like this is kinda what I am tinkering around with.

Individ$Power2 <- Top_percentiles$Power2[match(Individ$Participant, Top_percentiles$Participant) &&  
                                         match(Individ$Condition, Top_percentiles$Condition) &&
                                         match(Individ$Location, Top_percentiles$Location)]

Individ$Power3 <- Top_percentiles$Power3[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]

Individ$Power4 <- Top_percentiles$Power4[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]


Individ


    Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4  Power2   Power3
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)   (dbl)    (dbl)
1         Bill     1   Placebo     Home   400              0         0.0000            0.0 551.350 510.0667
2         Bill     2   Placebo     Home   250            325         0.0000            0.0 464.650 465.6667
3         Bill     3   Placebo     Home   180            215       276.6667            0.0 337.750 305.0000
4         Bill     4   Placebo     Home   500            340       310.0000          332.5 585.175 533.4000
5         Bill     1      Expr     Away   300            400       326.6667          307.5 322.150 280.7667
6         Bill     2      Expr     Away   450            375       416.6667          357.5 556.500 556.5000
7         Bill     3      Expr     Away   600            525       450.0000          462.5 551.350 510.0667
8         Bill     4      Expr     Away   512            556       520.6667          465.5 464.650 465.6667
9         Bill     1      Expr     Home   300            406       470.6667          465.5 337.750 305.0000
10        Bill     2      Expr     Home   500            400       437.3333          478.0 585.175 533.4000

我在这里的想法是将分位数列匹配到Individ数据集上.

My idea here was to match the quantile columns onto the Individ dataset.

第4步:过滤数据集

这应该可以让您想要.

选项1:三个独立的数据集

top_percentile_2sec <- Individ %>% filter(Rolling.mean.2 >= Power2)
top_percentile_3sec <- Individ %>% filter(Rolling.mean.3 >= Power3)
top_percentile_4sec <- Individ %>% filter(Rolling.mean.4 >= Power4)

选项2:一个大的合并数据集

top_percentile_all_times <- Individ %>% filter(Rolling.mean.2 >= Power2 | Rolling.mean.3 >= Power3 | Rolling.mean.4 >= Power4)


top_percentile_all_times

 Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2   Power3
       (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)  (dbl)    (dbl)
1        Bill     1      Expr     Away   300          400.0       326.6667         307.50 322.15 280.7667
2        Bill     4      Expr     Away   512          556.0       520.6667         465.50 464.65 465.6667
3        Bill     1      Expr     Home   300          406.0       470.6667         465.50 337.75 305.0000
4        Bill     3      Expr     Home   450          475.0       416.6667         440.50 322.15 280.7667
5       Harry     1      Expr     Away   310          415.0       320.0000         292.50 322.15 280.7667
6       Harry     3      Expr     Away   608          529.5       456.3333         472.25 551.35 510.0667
7       Harry     4      Expr     Away   582          595.0       547.0000         487.75 464.65 465.6667
8        Paul     3      Expr     Home     0          570.0       480.0000           0.00 322.15 280.7667
9        Paul     4      Expr     Home     0            0.0       570.0000         480.00 556.50 556.5000

下面的链接极大地帮助了我.

Below is a link that greatly helped me out.

how to calculate 95th percentile of values with grouping variable in R or Excel

这也可以从其他帖子中解决您的问题吗?

这篇关于使用plyr计算大于95%的分位数时出现错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆