用重复序列替换df [英] subsetting df with repeated sequences

查看:112
本文介绍了用重复序列替换df的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在高低寻找一个解决方案,但是我找不到一个.....

I have searched high and low for a solution to this, but I cannot find one.....

我的数据框(按日期本质上是排名第一的运动队的表)在很多情况下一个或多个团队将在数据中重新出现".我想取消每个周期的开始(或结束)日期.每个团队1个.

My dataframe (essentially a table of the no. 1 sports team by date) has numerous occasions where one or various teams would "reappear" in the data. I want to pull out the start (or end) date of each period at no. 1 per team.

数据的示例可能是:

x1<- as.Date("2013-12-31")
adddate1 <- 1:length(teams1)
dates1 <- x1 + adddate1
teams2 <- c(rep("w", 3), rep("c", 8), rep("w", 4))
x2<- as.Date("2012-12-31")
adddate2 <- 1:length(teams2)
dates2 <- x2 + adddate2
dates <- c(dates2, dates1)
teams <- c(teams2, teams1)
df <- data.frame(dates, teams)
df$year <- year(df$dates)

2013年的样子:

        dates teams year
1  2013-01-01     w 2013
2  2013-01-02     w 2013
3  2013-01-03     w 2013
4  2013-01-04     c 2013
5  2013-01-05     c 2013
6  2013-01-06     c 2013
7  2013-01-07     c 2013
8  2013-01-08     c 2013
9  2013-01-09     c 2013
10 2013-01-10     c 2013
11 2013-01-11     c 2013
12 2013-01-12     w 2013
13 2013-01-13     w 2013
14 2013-01-14     w 2013
15 2013-01-15     w 2013

但是,使用ddply会聚集名称相同的团队并返回以下内容:

However, using ddply aggregates the identically-named teams and returns the following:

split <- ddply(df, .(year, teams), head,1)
split <- split[order(split[,1]),]

       dates teams year
2 2013-01-01     w 2013
1 2013-01-04     c 2013
3 2014-01-01     c 2014
4 2014-01-09     k 2014

有没有比创建一个函数更优雅的方法,该函数将遍历原始df并为每个子集返回一个唯一值,将其添加到df中,然后使用ddply并入新的唯一值以返回我要吗?

Is there a more elegant way to do this than creating a function which would go through the original df and return a unique value for each subset, add this to the df and then use ddply incorporating the new unique value to return what I want?

推荐答案

您说有些团队重新出现",那时我认为

You say some teams "reappear" and at that point I thought the little intergroup helper function from this answer might be just the right tool here. It is useful when in your case, there are teams e.g. "w" that reappear in the same year, e.g. 2013, after another team has been there for some time, e.g. "c". Now if you want to treat each sequence of occurence per team as separate groups in order to get the first or last date of that sequence, that when this function is useful. Note that if you only group by "team" and "year" as you would normally do, each team, e.g. "w" could only have one first/last date (for example when using "summarise" in dplyr).

定义功能:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

现在先按年份对数据进行分组,然后再使用团队"列上的组间功能进行分组:

Now group your data first by year and then additionally by using the intergroup function on the teams column:

library(dplyr)
df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  filter(dense_rank(dates) == 1)

最后,您可以根据需要进行过滤.例如,在这里,我过滤了最小日期.结果将是:

Finally, you can filter according to your needs. Here for example, I filter the min dates. The result would be:

#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
#       dates teams year teamindex
#1 2013-01-01     w 2013         1
#2 2013-01-04     c 2013         2
#3 2013-01-12     w 2013         3

请注意,再次出现团队"w"是因为我们通过使用组间函数创建的"teamindex"进行了分组.

Note that team "w" reappears because we grouped by "teamindex" which we created by using intergroup function.

执行过滤的另一种方法是这样的(先使用排列,然后再使用slice):

Another option to do the filtering is like this (using arrange and then slice):

df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  arrange(dates) %>%
  slice(1)

我使用的数据来自akrun的答案.

The data I used is from akrun's answer.

这篇关于用重复序列替换df的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆