data.frame列中至少连续五年的子集 [英] Subset where there are at least five consecutive years in a data.frame column
本文介绍了data.frame列中至少连续五年的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在R中有一个data.frame / data.table,如下所示:
I have a data.frame / data.table in R as follows:
df <- data.frame(
ID = c(rep("A", 20)),
year = c(1968, 1971, 1972, 1973, 1974, 1976, 1978, 1980, 1982, 1984, 1985,
1986, 1987, 1988, 1990, 1991, 1992, 1993, 1994, 1995)
)
我想对df进行子集化,以便仅保留连续至少五年的条目。在此示例中,这是两个时期(1984:1988和1990:1995)的情况。
I'd like to subset the df in order to keep only those entries which have at least five consecutive years. In this example this is the case in two periods (1984:1988 and 1990:1995). How can I do this in R?
推荐答案
使用 diff
和 cumsum
:
setDT(df)[, grp := cumsum(c(0, diff(year)) > 1), by = ID
][, if (.N > 4) .SD, by = .(ID, grp)][, grp := NULL][]
这将提供所需的结果:
ID year
1: A 1984
2: A 1985
3: A 1986
4: A 1987
5: A 1988
6: A 1990
7: A 1991
8: A 1992
9: A 1993
10: A 1994
11: A 1995
说明:
- 使用
grp:= cumsum(c(0,diff(year))> 1),通过= ID
创建一个(临时)分组每个ID
连续变量。 - 使用
if(.N> 4).SD, by =。(ID,grp)
,对于每个ID
,您只能选择连续5年或更长时间的组。 - 使用
grp := NULL
删除(临时)分组变量。
- With
grp := cumsum(c(0, diff(year)) > 1), by = ID
you create a (temporary) grouping variable for consecutive years for eachID
. - With
if (.N > 4) .SD, by = .(ID, grp)
you select only groups with 5 or more consecutive years for eachID
. - With
grp := NULL
you remove the (temporary) grouping variable.
以R为基的可比较方法:
A compareble approach in base R:
i <- with(df, ave(year, ID, FUN = function(x) {
r <- rle(cumsum(c(0, diff(year)) > 1));
rep(r$lengths, r$lengths)
} ))
df[i > 4,] # or df[which(i > 4),]
结果相同。
这篇关于data.frame列中至少连续五年的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文