计算类别中重复测量期间类别变量的变化次数 [英] Counting the number of changes of a categorical variable during repeated measurements within a category
问题描述
我正在使用以下列来处理有关全国移民的数据集:
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i =个人ID.他们跟随一大群人25年,并记录区域"(分类变量1-4),城市"(虚拟),工资"和教育"的变化.
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
在每个主题的观察期(25年期间)中,如何计算区域"或城市"发生变化的总次数(例如,从区域1到区域3或从城市0到1)?我的数据中也有一些NA(应该忽略)
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
预期输出的简化版本:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
然后,我想总结一下区域和城市的变化数量.
I would then like to sum up the number of changes for region and urban.
我遇到了以下答案:计数重复测量期间类别变量的变化和识别变化在R 中跨数据点的分类数据中,但我仍然没有得到它.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
这是i = 4数据的一部分.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
此处,输出应为:
i change_reg change_urban
4 3 0
推荐答案
在这里,我希望可以使您更接近所需.
Here is something I hope will get your closer to what you need.
首先,您按 i
分组.然后,您可以创建一列,该列将为区域中的每个更改指示1.这会将区域的当前值与上一个值进行比较(使用 lag
).请注意,如果先前的值为 NA
(当查看给定的 i
的第一个值时),则将其视为不变.
First you group by i
. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag
). Note if the previous value is NA
(when looking at the first value for a given i
), it will be considered no change.
对城市采取相同的方法.然后,汇总每个 i
的所有更改.我留下了这些临时变量,以便您可以检查是否获得所需的结果.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i
. I left in these temporary variables so you can examine if you are getting the results desired.
编辑:如果您要删除对 region
或 urban
具有 NA
的行,则可以添加 drop_na
首先.
Edit: If you wish to remove rows that have NA
for region
or urban
you can add drop_na
first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
编辑:之后,要获得 tot_region
和 tot_urban
列的总计,可以使用 colSums
.(如上所述,将您先前的结果存储为 df_tot
.)
Edit: Afterwards, to get a grand total for both tot_region
and tot_urban
columns, you can use colSums
. (Store your earlier result as df_tot
as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
这篇关于计算类别中重复测量期间类别变量的变化次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!