在 R 中使用多个重复变量整理数据 [英] Tidying data with several repeating variables in R

查看:22
本文介绍了在 R 中使用多个重复变量整理数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框.有几个变量(如c"和z")可以衡量健康、动物、环境和金钱.在实际的数据框中,还有许多其他列不遵循此模式,而是贯穿始终.

I have a dataframe that looks like the following. There are several variables (like "c" and "z") with measurements for health, animals, enviro, and money. In the actual dataframe, there are many other columns that do not follow this pattern and are interspersed throughout.

id  c_health  c_animals  c_enviro  c_money  z_health  z_animals  z_enviro  z_money
1   3         2          4         5        7         9          6         8
2   2         3          5         4        8         7          6         9
3   4         1          2         3        9         6          8         7

我正在尝试重新排列数据以使其整洁".当我当前的数据集中有多个变量时,我不确定该怎么做.这是我最终想要的结果:

I am trying to rearrange the data to make it "tidy". I am not sure what to do when there are several variables like in my current dataset. This is the kind of result I would eventually like to end up with:

id  c  z  message
1   3  7  health
1   2  9  animals
1   4  6  enviro
1   5  8  money
2   2  8  health
2   3  7  animals
2   5  6  enviro
2   4  9  money
3   4  9  health
3   1  6  animals
3   2  8  enviro
3   3  7  money

如果数据框只包含以下列,我可以通过以下方式进行整理:

If the dataframe just included the following columns, I could make it tidy in the following way:

id  c_health  c_animals  c_enviro  c_money
1   3         2          4         5
2   2         3          5         4
3   4         1          2         3

df <- df %>%
   gather(., key = "question", value = "response", 2:5)

推荐答案

您在使用 gather 方面走在正确的轨道上,但需要一些额外的步骤来将前缀从列名中分离出来.请尝试以下操作:

You are on the right track with using gather, but need some additional steps to split the prefix off the column names. Try the following:

library(dplyr)
library(tidyr)

df = data.frame(
  id = c(1,2,3),
  c_health = c(3,2,4),
  c_animals = c(2,3,1),
  z_health = c(7,8,9),
  z_animals = c(9,7,6),
  stringsAsFactors = FALSE
)

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(prefix = substr(question,1,1),
         desc = substr(question,3,nchar(question))) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

更新 - 我对不同前缀长度的评论没有返回正确答案.因为 [] 索引在 mutate 中不起作用.相同的想法但正确的语法如下:

Update - my comment on differing prefix lengths does not return the correct answer. Because [] indexing does not work that way inside mutate. Same idea but correct syntax as follows:

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(split = strsplit(question, "_")) %>%
  mutate(prefix = sapply(split, function(x){x[1]}),
         desc = sapply(split, function(x){x[2]})) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

这篇关于在 R 中使用多个重复变量整理数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆