使用 tidyr 收集时保留属性(属性不相同) [英] Retain attributes when using gather from tidyr (attributes are not identical)

查看:24
本文介绍了使用 tidyr 收集时保留属性(属性不相同)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,需要拆分成两个表来满足 Codd 的第三范式.在一个简单的情况下,原始数据框看起来像这样:

I have a data frame that needs to be split into two tables to satisfy Codd's 3rd normal form. In a simple case, the original data frame looks something like this:

library(lubridate)
> (df <- data.frame(hh_id = 1:2,
                   income = c(55000, 94000),
                   bday_01 = ymd(c(20150309, 19890211)),
                   bday_02 = ymd(c(19850911, 20000815)),
                   gender_01 = factor(c("M", "F")),
                   gender_02 = factor(c("F", "F"))))

    hh_id income    bday_01    bday_02 gender_01 gender_02
  1     1  55000 2015-03-09 1985-09-11         M         F
  2     2  94000 1989-02-11 2000-08-15         F         F

当我使用 gather 函数时,它会警告属性不相同,并且会丢失性别因素和 bday 的 lubridate(或实际示例中的其他属性).有没有好的 tidyr 解决方案来避免丢失每列的数据类型?

When I use the gather function, it warns that the attributes are not identical and loses the factor for gender and the lubridate for bday (or other attributes in the real-world example). Is there a nice tidyr solution to avoid the loss of each column's data type?

library(tidyr)
> (person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      gather(key, value, -hh_id) %>%
      separate(key, c("key", "per_num"), sep = "_") %>%
      spread(key, value))

     hh_id per_num       bday gender
   1     1      01 1425859200      M
   2     1      02  495244800      F
   3     2      01  603158400      F
   4     2      02  966297600      F

   Warning message:
   attributes are not identical across measure variables; they will be dropped

> lapply(person, class)

  $hh_id
  [1] "integer"

  $per_num
  [1] "character"

  $bday
  [1] "character"

  $gender
  [1] "character"

我可以想象一种方法,即分别收集具有相同数据类型的每组变量,然后连接所有表,但是我缺少一个更优雅的解决方案.

I can imagine a way to do it by gathering each set of variables with the same data type separately and then joining all the tables, but there must be a more elegant solution that I'm missing.

推荐答案

您可以将日期转换为字符,然后在最后将它们转换回日期:

You could just convert your dates to character then convert them back to dates at the end:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      mutate_each(funs(as.character), contains('bday')) %>%
      gather(key, value, -hh_id) %>%
      separate(key, c("key", "per_num"), sep = "_") %>%
      spread(key, value) %>%
      mutate(bday=ymd(bday)))

  hh_id per_num       bday gender
1     1      01 2015-03-09      M
2     1      02 1985-09-11      F
3     2      01 1989-02-11      F
4     2      02 2000-08-15      F

或者,如果您使用 Date 而不是 POSIXct,您可以执行以下操作:

Alternatively, if you use Date instead of POSIXct, you could do something like this:

(person <- df %>% 
      select(hh_id, bday_01:gender_02) %>% 
      gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
      gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
      mutate(bday=as.Date(bday)) %>%
      mutate_each(funs(str_extract(., '\\d+')), per_num1, per_num2) %>%
      filter(per_num1 == per_num2) %>%
      rename(per_num=per_num1) %>%
      select(-per_num2))

编辑

您看到的警告:

Warning: attributes are not identical across measure variables; they will be dropped

来自收集性别列,这些列是因子并具有不同的级别向量(请参阅str(df)).如果您要将性别列转换为字符,或者您要将它们的级别与类似的内容同步,

arises from gathering the gender columns, which are factors and have different level vectors (see str(df)). If you were to convert the gender columns to character or if you were to synchronize their levels with something like,

df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))

然后你会看到执行时警告消失了

then you will see that the warning goes away when you execute

person <- df %>% 
        select(hh_id, bday_01:gender_02) %>% 
        gather(key, value, contains('gender'))

这篇关于使用 tidyr 收集时保留属性(属性不相同)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆