使用汇总(dplyr)的结果来变异原始数据帧 [英] Using the result of summarise (dplyr) to mutate the original dataframe
问题描述
我有一个相当大的数据框,其中有一列 POSIXct 日期时间(每小时数据约 10 年).我会标记当天属于夏令时的所有行.例如,如果夏令时从 '2000-04-02 03:00:00' (DOY=93) 开始,我希望可以标记 DOY=93 的前两个小时.虽然我是 dplyr 的新手,但我会尽可能多地使用这个包,并尽可能避免 for-loops
I have a rather big dataframe with a column of POSIXct datetimes (~10yr of hourly data). I would flag all the rows in which the day falls in a Daylight saving period. For example if the Daylight shift starts on '2000-04-02 03:00:00' (DOY=93) i would like that the two previous hours of DOY=93 could be flagged. Although I am a newbie of dplyr I would use this package as much as possible and avoid for-loops as much as possible
例如:
library(lubridate)
sd = ymd('2000-01-01',tz="America/Denver")
ed = ymd('2005-12-31',tz="America/Denver")
span = data.frame(date=seq(from=sd,to=ed, by="hour"))
span$YEAR = year(span$date)
span$DOY = yday(span$date)
span$DLS = dst(span$date)
要查找应用夏令时的一年中的不同天数,我使用 dplyr
To find the different days of the year in which the daylight saving is applied I use dplyr
library(dplyr)
limits = span %.% group_by(YEAR) %.% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]))
这给了
YEAR minDOY maxDOY
1 2000 93 303
2 2001 91 301
3 2002 97 300
4 2003 96 299
5 2004 95 305
6 2005 93 303
现在我将在 span 数据帧中管道化"上述结果,而不使用低效的 for 循环.
Now I would 'pipe' the above results in the span dataframe without using a inefficient for-loop.
在@aosmith 的帮助下,只需两个命令即可解决该问题(并避免解决方案 2"中的 inner_join):
with the help of @aosmith the problem can be tackled with just two commands (and avoiding the inner_join as in 'solution 2'):
limits = span %>% group_by(YEAR) %>% mutate(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS]),CHECK=FALSE)
limits$CHECK[(limits2$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE
解决方案 2
在@beetroot 和@matthew-plourde 的帮助下,问题已经解决:缺少内部连接:
SOLUTION 2
With the help of @beetroot and @matthew-plourde, the problem has been solved: an inner-join between was missing:
limits = span %>% group_by(YEAR) %>% summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>% inner_join(span, by='YEAR')
然后我刚刚添加了一个新列 (CHECK) 来填充夏令时的正确值
Then I just added a new column (CHECK) to fill with the right values for the Daylight-savings days
limits$CHECK = FALSE
limits$CHECK[(limits$DOY >= limits$minDOY) & (limits$DOY <= limits$maxDOY) ] = TRUE
推荐答案
正如@beetroot 在评论中指出的那样,您可以通过加入来实现:
As @beetroot points out in the comments, you can accomplish this with a join:
limits = span %>%
group_by(YEAR) %>%
summarise(minDOY=min(DOY[DLS]),maxDOY=max(DOY[DLS])) %>%
inner_join(span, by='YEAR')
# YEAR minDOY maxDOY date DOY DLS
# 1 2000 93 303 2000-01-01 00:00:00 1 FALSE
# 2 2000 93 303 2000-01-01 01:00:00 1 FALSE
# 3 2000 93 303 2000-01-01 02:00:00 1 FALSE
# 4 2000 93 303 2000-01-01 03:00:00 1 FALSE
# 5 2000 93 303 2000-01-01 04:00:00 1 FALSE
# 6 2000 93 303 2000-01-01 05:00:00 1 FALSE
# 7 2000 93 303 2000-01-01 06:00:00 1 FALSE
# 8 2000 93 303 2000-01-01 07:00:00 1 FALSE
# 9 2000 93 303 2000-01-01 08:00:00 1 FALSE
# 10 2000 93 303 2000-01-01 09:00:00 1 FALSE
这篇关于使用汇总(dplyr)的结果来变异原始数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!