基于一个月的观察次数 [英] Subsetting based on observations in a month
问题描述
我正在尝试对某些数据进行子集,并将其停留在清洁的最后部分。
I'm trying to subset some data and am stuck on the last part of cleaning.
我需要做的是计算每个人(6月,7月和8月)每个人(indivID)的观察次数,并返回每个人的观察次数,而不会丢失数据,然后保持超过75%的观察结果。
What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.
我能够创建一个嵌套for循环,但今天需要花费大概6个小时。我想通过使用ddply或另一个功能来利用并行计算机,但是很失落。
I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.
这是数据(注意这是一个非常小的子集,只包括从1:5的个人):
https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
Here's the data (Note this is a very small subset that only includes individuals from 1:5): https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
这里是for循环:
epa.d <- read.csv("/.../data.csv")
#Function for loops
days <- function (month){
if (month == 06) return(as.numeric(30))
if (month == 07) return(as.numeric(31))
if (month == 08) return(as.numeric(31))
}
#Subset data for 75% in June, July, and August
for (i in unique(epa.d$indivID)){
for (j in unique(epa.d$year)){
for (k in unique(epa.d$month)){
monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k )
monthperc = (monthsum/days(k))* 100
if (monthperc < 75){
epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]
}
}
}
}
推荐答案
如果我正确理解你,你想保持每个组合的每日观察其中至少75%的天数具有臭氧测量。这是一个非常快速的方法:
If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:
library(dplyr)
# For each indivID, calculate percent of days in each month with
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
filter(pctCoverage >= 0.75)
我们现在有一个数据框 epa.d_75
每个individ-month-year有一行,覆盖率至少为75%。接下来,我们将每日数据合并到此数据框中,为每个独立indivID-month-year的每个日常观察结果生成一行。
We now have a data frame epa.d_75
that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.
# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
更新:要回答问题:
-
%>%正在做什么,如果可能,你会如何逻辑思考这个。
Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.
%>%
是一个链接运算符,可以让您一个接一个地链接功能,在运行下一个功能之前必须存储上一个功能的结果。看看 dplyr
小插图,了解有关如何使用它的更多信息。在这种情况下,逻辑如何工作:
The %>%
is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr
Vignette to learn more about how to use it. Here's how the logic works in this case:
group_by
将分组变量分割出数据集,然后运行下一个功能分别在每个组。在这种情况下,总结
计算 indivID
,<$ c的每个独特组合的数据框中的行数$ c> month 和 year
,然后 mutate
添加一个带有分数范围的列对于个月
和年
的 indivID
。 过滤器
然后摆脱 indivID
,月的任何组合
,和年
,覆盖率低于75%。你可以随时停止链条,看看它在做什么。例如,运行以下代码以查看过滤操作之前的 epa.d_75
:
group_by
splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise
counts the number of rows in the data frame for each unique combination of indivID
, month
, and year
, then mutate
adds a column with the fractional coverage for that indivID
for that month
and year
. filter
then gets rid of any combination of indivID
, month
, and year
with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75
looks like before the filtering operation:
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31))
- 为什么这样快得多比运行循环?我不知道答案详细,但是
dplyr
在C
代码的大部分魔法中,这比本地R
更快。希望别人能给出更准确和详细的答案。
- why the hell this is so much faster than running for loops? I don't know the answer in detail, but
dplyr
does most of its magic inC
code under the hood, which is faster than nativeR
. Hopefully someone else can give a more precise and detailed answer.
这篇关于基于一个月的观察次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!