基于一个月的观察次数 [英] Subsetting based on observations in a month

查看:111
本文介绍了基于一个月的观察次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对某些数据进行子集,并将其停留在清洁的最后部分。

I'm trying to subset some data and am stuck on the last part of cleaning.

我需要做的是计算每个人(6月,7月和8月)每个人(indivID)的观察次数,并返回每个人的观察次数,而不会丢失数据,然后保持超过75%的观察结果。

What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.

我能够创建一个嵌套for循环,但今天需要花费大概6个小时。我想通过使用ddply或另一个功能来利用并行计算机,但是很失落。

I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.

这是数据(注意这是一个非常小的子集,只包括从1:5的个人):
https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0

Here's the data (Note this is a very small subset that only includes individuals from 1:5): https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0

这里是for循环:

epa.d <- read.csv("/.../data.csv")

#Function for loops
days <- function (month){
     if (month == 06) return(as.numeric(30))
     if (month == 07) return(as.numeric(31))
     if (month == 08) return(as.numeric(31))

}    

#Subset data for 75% in June, July, and August
    for (i in unique(epa.d$indivID)){
         for (j in unique(epa.d$year)){
              for (k in unique(epa.d$month)){
                   monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k   )
                   monthperc = (monthsum/days(k))* 100
                   if (monthperc < 75){
                        epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]  

                   }
              }
         }
    }


推荐答案

如果我正确理解你,你想保持每个组合的每日观察其中至少75%的天数具有臭氧测量。这是一个非常快速的方法:

If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:

library(dplyr)  

# For each indivID, calculate percent of days in each month with 
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>% 
  group_by(indivID, year, month) %>%
  summarise(count=n()) %>% 
  mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
  filter(pctCoverage >= 0.75)

我们现在有一个数据框 epa.d_75 每个individ-month-year有一行,覆盖率至少为75%。接下来,我们将每日数据合并到此数据框中,为每个独立indivID-month-year的每个日常观察结果生成一行。

We now have a data frame epa.d_75 that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.

# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
                 all.x=TRUE)

更新:要回答问题:


  1. %>%正在做什么,如果可能,你会如何逻辑思考这个。

  1. Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.

%>%是一个链接运算符,可以让您一个接一个地链接功能,在运行下一个功能之前必须存储上一个功能的结果。看看 dplyr 小插图,了解有关如何使用它的更多信息。在这种情况下,逻辑如何工作:

The %>% is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:

group_by 将分组变量分割出数据集,然后运行下一个功能分别在每个组。在这种情况下,总结计算 indivID ,<$ c的每个独特组合的数据框中的行数$ c> month 和 year ,然后 mutate 添加一个带有分数范围的列对于个月 indivID 过滤器然后摆脱 indivID 月的任何组合 ,和,覆盖率低于75%。你可以随时停止链条,看看它在做什么。例如,运行以下代码以查看过滤操作之前的 epa.d_75

group_by splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise counts the number of rows in the data frame for each unique combination of indivID, month, and year, then mutate adds a column with the fractional coverage for that indivID for that month and year. filter then gets rid of any combination of indivID, month, and year with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75 looks like before the filtering operation:




 epa.d_75 = epa.d %>% 
  group_by(indivID, year, month) %>%
  summarise(count=n()) %>% 
  mutate(pctCoverage = ifelse(month==6, count/30, count/31))





  1. 为什么这样快得多比运行循环?我不知道答案详细,但是 dplyr C 代码的大部分魔法中,这比本地 R 更快。希望别人能给出更准确和详细的答案。

  1. why the hell this is so much faster than running for loops? I don't know the answer in detail, but dplyr does most of its magic in C code under the hood, which is faster than native R. Hopefully someone else can give a more precise and detailed answer.

这篇关于基于一个月的观察次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆