计算类别中重复测量期间类别变量的变化次数 [英] Counting the number of changes of a categorical variable during repeated measurements within a category

查看:75
本文介绍了计算类别中重复测量期间类别变量的变化次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下列来处理有关全国移民的数据集:

I'm working with a dataset about migration across the country with the following columns:

i   birth   gender  race    region  urban   wage    year  educ
1   58      2        3      1       1       4620    1979   12
1   58      2        3      1       1       4620    1980   12
1   58      2        3      2       1       4620    1981   12
1   58      2        3      2       1       4700    1982   12

.....

i   birth   gender  race    region  urban   wage    year  educ
45   65      2        3      3       1      NA       1979   10
45   65      2        3      3       1      NA       1980   10
45   65      2        3      4       2      11500    1981   10
45   65      2        3      1       1      11500    1982   10

i =个人ID.他们跟随一大群人25年,并记录区域"(分类变量1-4),城市"(虚拟),工资"和教育"的变化.

i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.

在每个主题的观察期(25年期间)中,如何计算区域"或城市"发生变化的总次数(例如,从区域1到区域3或从城市0到1)?我的数据中也有一些NA(应该忽略)

How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)

预期输出的简化版本:

i  changes in region
1   1
...
45  2

i  changes in urban
1   0
...
45  2

然后,我想总结一下区域和城市的变化数量.

I would then like to sum up the number of changes for region and urban.

我遇到了以下答案:计数重复测量期间类别变量的变化识别变化在R 中跨数据点的分类数据中,但我仍然没有得到它.

I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.

这是i = 4数据的一部分.

Here's a part of the data for i=4.

i   birth gender    race    region  urban   wage    year    educ
4   62      2        3        1      1       NA     1979    9
4   62      2        3        NA     NA      NA     1980    9
4   62      2        3        4      1       0      1981    9
4   62      2        3        4      1       1086   1982    9
4   62      2        3        1      1       70     1983    9
4   62      2        3        1      1       0      1984    9
4   62      2        3        1      1       0      1985    9
4   62      2        3        1      1       7000   1986    9
4   62      2        3        1      1      17500   1987    9
4   62      2        3        1      1      21320   1988    9
4   62      2        3        1      1      21760   1989    9
4   62      2        3        1      1         0    1990    9
4   62      2        3        1      1         0    1991    9
4   62      2        3        1      1      30500   1992    9
4   62      2        3        1      1      33000   1993    9
4   62      2        3       NA     NA        NA    1994    9
4   62      2        3        4      1      35000   1996    9

此处,输出应为:

i change_reg   change_urban
4  3            0

推荐答案

在这里,我希望可以使您更接近所需.

Here is something I hope will get your closer to what you need.

首先,您按 i 分组.然后,您可以创建一列,该列将为区域中的每个更改指示1.这会将区域的当前值与上一个值进行比较(使用 lag ).请注意,如果先前的值为 NA (当查看给定的 i 的第一个值时),则将其视为不变.

First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.

对城市采取相同的方法.然后,汇总每个 i 的所有更改.我留下了这些临时变量,以便您可以检查是否获得所需的结果.

Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.

编辑:如果您要删除对 region urban 具有 NA 的行,则可以添加 drop_na 首先.

Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.

library(dplyr)
library(tidyr)

df_tot <- df %>%
  drop_na(region, urban) %>%
  group_by(i) %>%
  mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
         urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
  summarize(tot_region = sum(reg_change),
            tot_urban = sum(urban_change))

# A tibble: 3 x 3
      i tot_region tot_urban
  <int>      <dbl>     <dbl>
1     1          1         0
2     4          3         0
3    45          2         2

编辑:之后,要获得 tot_region tot_urban 列的总计,可以使用 colSums .(如上所述,将您先前的结果存储为 df_tot .)

Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)

colSums(df_tot[-1])

tot_region  tot_urban 
         6          2 

这篇关于计算类别中重复测量期间类别变量的变化次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆