基于连续值的组变量 [英] Group variable based on continuous values

查看:67
本文介绍了基于连续值的组变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样一个具有年度数据的数据框架,但是,多年来没有数据(这里是:1956年,1961-1964年)。

I have such a data frame with annual data, however, for some years there is no data (here: 1956, 1961-1964).

dat <- data.frame(Year = c(1950:1955, 1957:1960, 1965:1970),
                  Val = 1:16)
> dat
   Year Val
1  1950   1
2  1951   2
3  1952   3
4  1953   4
5  1954   5
6  1955   6
7  1957   7
8  1958   8
9  1959   9
10 1960  10
11 1965  11
12 1966  12
13 1967  13
14 1968  14
15 1969  15
16 1970  16

I' d想在每个期间的最小和最大年份中添加变量期间,其中某个期间定义为一组连续的年份(即1950-1955、1957-1960和1965-1970)。创建此变量本身不是问题,但是我仍然坚持如何进行分组。有什么想法吗?

I'd like to add a variable "Period" with the min and max years for each period, where a period is defined as a set of continuous years (i.e. 1950-1955, 1957-1960 and 1965-1970). Creating this variable is not a problem itself, but I am stuck on how to do the grouping. Any ideas?

dat %>%
  ...???... %>%
  mutate(Period = paste(min(Year), max(Year), sep = "-"))


推荐答案

您可以创建连续时间段的ID:

You can create an ID for the continuous periods:

dat$cont_per <- cumsum(!c(TRUE, diff(dat$Year)==1))

然后基于此计算最小/最大值。例如,使用 data.table

And then compute the min/max values based on that. For example, with data.table:

library(data.table)
setDT(dat)
dat[, Period := paste(min(Year), max(Year), sep="-"), by=cont_per]
 dat
    # Year Val cont_per    Period
 # 1: 1950   1        0 1950-1955
 # 2: 1951   2        0 1950-1955
 # 3: 1952   3        0 1950-1955
 # 4: 1953   4        0 1950-1955
 # 5: 1954   5        0 1950-1955
 # 6: 1955   6        0 1950-1955
 # 7: 1957   7        1 1957-1960
 # 8: 1958   8        1 1957-1960
 # 9: 1959   9        1 1957-1960
# 10: 1960  10        1 1957-1960
# 11: 1965  11        2 1965-1970
# 12: 1966  12        2 1965-1970
# 13: 1967  13        2 1965-1970
# 14: 1968  14        2 1965-1970
# 15: 1969  15        2 1965-1970
# 16: 1970  16        2 1965-1970






NB: 您也可以直接计算期限,而无需创建变量variabel cont_per


N.B.: You can also compute the Period directly, without creating the variabel cont_per:

setDT(dat)[, Period := paste(min(Year), max(Year), sep="-"), by=cumsum(!c(TRUE, diff(Year)==1))]
head(dat)
#    Year Val    Period
# 1: 1950   1 1950-1955
# 2: 1951   2 1950-1955
# 3: 1952   3 1950-1955
# 4: 1953   4 1950-1955
# 5: 1954   5 1950-1955
# 6: 1955   6 1950-1955






类似地,带有 dplyr


Similarly, with dplyr:

dat %>% 
   group_by(count_per=cumsum(!c(TRUE, diff(dat$Year)==1))) %>% 
   mutate(Period=paste(min(Year), max(Year), sep="-"))

这篇关于基于连续值的组变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆