创建在面板数据组内的条件下重新启动的顺序计数器 [英] Create sequential counter that restarts on a condition within panel data groups
问题描述
我有一个面板数据集,我想为其创建一个计数器,该计数器随着面板中的每一步而增加,但只要出现某种情况就会重新启动.就我而言,我使用的是国家/地区年数据,并希望计算事件之间的年数.这是一个玩具数据集,其中包含我真实数据的主要特征:
I have a panel data set for which I would like to create a counter that increases with each step in the panel but restarts whenever some condition occurs. In my case, I'm using country-year data and want to count the passage of years between an event. Here's a toy data set with the key features of my real one:
df <- data.frame(country = rep(c("A","B"), each=5), year=rep(2000:2004, times=2), event=c(0,0,1,0,0,1,0,0,1,0), stringsAsFactors=FALSE)
我想要做的是创建一个计数器,该计数器在每个国家/地区的一系列观察中以 df$event
为键.当我们开始观察每个国家时,时钟从 1 开始;每年增加1;并且每当 df$event==1
时它都会从 1 重新开始.所需的输出是这样的:
What I'm looking to do is to create a counter that is keyed to df$event
within each country's series of observations. The clock starts at 1 when we start observing each country; it increases by 1 with the passage of each year; and it restarts at 1 whenever df$event==1
. The desired output is this:
country year event clock
1 A 2000 0 1
2 A 2001 0 2
3 A 2002 1 1
4 A 2003 0 2
5 A 2004 0 3
6 B 2000 1 1
7 B 2001 0 2
8 B 2002 0 3
9 B 2003 1 1
10 B 2004 0 2
我曾尝试使用 splitstackshape
中的 getanID
以及一些 if
和 ifelse
的变体,但都失败了远没有得到想要的结果.
I have tried using getanID
from splitstackshape
and a few variations of if
and ifelse
but have failed so far to get the desired result.
我已经在需要执行此操作的脚本中使用了 dplyr
,因此我更喜欢使用它的解决方案或基于 R 的解决方案,但我将不胜感激.我的数据集不是很大,所以速度不是关键,但效率总是加分项.
I'm already using dplyr
in the scripts where I need to do this, so I would prefer a solution that uses it or base R, but I would be grateful for anything that works. My data sets are not massive, so speed is not critical, but efficiency is always a plus.
推荐答案
使用 dplyr
将是:
df %>%
group_by(country, idx = cumsum(event == 1L)) %>%
mutate(counter = row_number()) %>%
ungroup %>%
select(-idx)
#Source: local data frame [10 x 4]
#
# country year event counter
#1 A 2000 0 1
#2 A 2001 0 2
#3 A 2002 1 1
#4 A 2003 0 2
#5 A 2004 0 3
#6 B 2000 1 1
#7 B 2001 0 2
#8 B 2002 0 3
#9 B 2003 1 1
#10 B 2004 0 2
或者使用data.table
:
library(data.table)
setDT(df)[, counter := seq_len(.N), by = list(country, cumsum(event == 1L))]
<小时>
group_by(country, idx = cumsum(event == 1L))
用于按国家和新的分组索引idx"分组.event == 1L
部分创建了一个逻辑索引,告诉我们event"列是否为整数 1 (TRUE
/FALSE
).然后,cumsum(...)
对前 2 行从 0 开始求和,对接下来的 3 行从 1 开始,对接下来的 3 行从 2 开始求和,依此类推.我们使用这个新列(+ 国家/地区)根据需要对数据进行分组.如果您删除 dplyr 代码中的最后一个管道部件,您可以检查它.
group_by(country, idx = cumsum(event == 1L))
is used to group by country and a new grouping index "idx". The event == 1L
part creates a logical index telling us whether the column "event" is an integer 1 or not (TRUE
/FALSE
). Then, cumsum(...)
sums up starting from 0 for the first 2 rows, 1 for the next 3, 2 for the next 3 and so on. We use this new column (+ country) to group the data as needed. You can check it out if you remove the last to pipe-parts in the dplyr code.
这篇关于创建在面板数据组内的条件下重新启动的顺序计数器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!