收集多组列 [英] Gather multiple sets of columns

查看:194
本文介绍了收集多组列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在线调查的数据,受访者通过一系列问题1-3次。调查软件(Qualtrics)将这些数据记录在多个列中,也就是说,调查中的Q3.2将具有列 Q3.2.1。 Q3 .2.2。 Q3.2.3。

  df<  -  data.frame(
id = 1:10,
time = as.Date('2009-01-01')+ 0:9,
Q3 .2.1。= rnorm(10,0,1),
Q3.2.2。= rnorm(10,0,1),
Q3.2.3。= rnorm(10,0,1),
Q3.3.1。= rnorm(10,0,1),
Q3.3.2。= rnorm(10,0,1),
Q3.3.3。= rnorm(10,0, 1)


#样本数据

id时间Q3.2.1。 Q3.2.2。 Q3.2.3。 Q3.3.1。 Q3.3.2。 Q3.3.3。
1 1 2009-01-01 -0.2059165 -0.29177677 -0.7107192 1.52718069 -0.4484351 -1.21550600
2 2 2009-01-02 -0.1981136 -1.19813815 1.1750200 -0.40380049 -1.8376094 1.03588482
3 3 2009 -01-03 0.3514795 -0.27425539 ​​1.1171712 -1.02641801 -2.0646661 -0.35353058
...

我想将所有的QN.N *列整合到整齐的个人QN.N列中,最终结束如下:

  id time loop_number Q3.2 Q3.3 
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01 -03 1 0.35147949 -1.02641801
...
11 1 2009-01-01 2 -0.29177677 -0.4484351
12 2 2009-01-02 2 -1.19813815 -1.8376094
13 3 2009-01-03 2 -0.27425539 ​​-2.0646661
...
21 1 2009-01-01 3 -0.71071921 -1.21550600
22 2 2009-01-02 3 1.17501999 1.03588482
23 3 2009-01-03 3 1.11717121 -0.35353058
...

tidyr 库具有 gather()函数,它非常适合将一个一组列:

 库(dplyr)
库(tidyr)
库(stringr)

df%>%gather(loop_number,Q3.2,starts_with(Q3.2))%>%
mutate(loop_number = str_sub(loop_number,-2,-2) )%>%
select(id,time,loop_number,Q3.2)


id time loop_number Q3.2
1 1 2009-01-01 1 -0.20591649
2 2 2009-01-02 1 -0.19811357
3 3 2009-01-03 1 0.35147949
...
29 9 2009-01-09 3 - 0.58581232
30 10 2009-01-10 3 -2.33393981

结果数据框架有30行,如预期(10个人,每个3个循环)。但是,收集第二组列不能正常工作 - 它成功地使两个组合列 Q3.2 Q3.3 ,但最终为90行而不是30行(10个人的所有组合,Q3.2的3个循环和Q3.3的3个循环;组合将在实际数据中的每组列显着增加) :

  df%>%gather(loop_number,Q3.2,starts_with(Q3.2))%> %
gather(loop_number,Q3.3,starts_with(Q3.3))%>%
mutate(loop_number = str_sub(loop_number,-2,-2))


id时间loop_number Q3.2 Q3.3
1 1 2009-01-01 1 -0.20591649 1.52718069
2 2 2009-01-02 1 -0.19811357 -0.40380049
3 3 2009-01-03 1 0.35147949 -1.02641801
...
89 9 2009-01-09 3 -0.58581232 -0.13187024
90 10 2009-01-10 3 -2.33393981 - 0.48502131

有没有办法使用mul tiple调用 gather()这样,在保持正确数量的行的情况下组合这样的小列子集?

解决方案

这种方法对我来说似乎很自然:

  df%>%
gather(key,value,-id,-time)%>%
extract(key,c(question,loop_number),(Q.\\ ..) \\\。(。))%>%
spread(question,value)

首先收集所有问题列,使用 extract()分成问题 loop_number ,然后 spread()回到列。

 code>#> id time loop_number Q3.2 Q3.3 
#> 1 1 2009-01-01 1 0.142259203 -0.35842736
#> 2 1 2009-01-01 2 0.061034802 0.79354061
#> 3 1 2009-01-01 3 -0.525686204 -0.67456611
#> 4 2 2009-01-02 1 -1.044461185 -1.19662936
#> 5 2 2009-01-02 2 0.393808163 0.42384717


I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will have columns Q3.2.1., Q3.2.2., and Q3.2.3.:

df <- data.frame(
  id = 1:10,
  time = as.Date('2009-01-01') + 0:9,
  Q3.2.1. = rnorm(10, 0, 1),
  Q3.2.2. = rnorm(10, 0, 1),
  Q3.2.3. = rnorm(10, 0, 1),
  Q3.3.1. = rnorm(10, 0, 1),
  Q3.3.2. = rnorm(10, 0, 1),
  Q3.3.3. = rnorm(10, 0, 1)
)

# Sample data

   id       time    Q3.2.1.     Q3.2.2.    Q3.2.3.     Q3.3.1.    Q3.3.2.     Q3.3.3.
1   1 2009-01-01 -0.2059165 -0.29177677 -0.7107192  1.52718069 -0.4484351 -1.21550600
2   2 2009-01-02 -0.1981136 -1.19813815  1.1750200 -0.40380049 -1.8376094  1.03588482
3   3 2009-01-03  0.3514795 -0.27425539  1.1171712 -1.02641801 -2.0646661 -0.35353058
...

I want to combine all the QN.N* columns into tidy individual QN.N columns, ultimately ending up with something like this:

   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
11  1 2009-01-01           2 -0.29177677  -0.4484351
12  2 2009-01-02           2 -1.19813815  -1.8376094
13  3 2009-01-03           2 -0.27425539  -2.0646661
...
21  1 2009-01-01           3 -0.71071921 -1.21550600
22  2 2009-01-02           3  1.17501999  1.03588482
23  3 2009-01-03           3  1.11717121 -0.35353058
...

The tidyr library has the gather() function, which works great for combining one set of columns:

library(dplyr)
library(tidyr)
library(stringr)

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  mutate(loop_number = str_sub(loop_number,-2,-2)) %>%
  select(id, time, loop_number, Q3.2)


   id       time loop_number        Q3.2
1   1 2009-01-01           1 -0.20591649
2   2 2009-01-02           1 -0.19811357
3   3 2009-01-03           1  0.35147949
...
29  9 2009-01-09           3 -0.58581232
30 10 2009-01-10           3 -2.33393981

The resultant data frame has 30 rows, as expected (10 individuals, 3 loops each). However, gathering a second set of columns does not work correctly—it successfully makes the two combined columns Q3.2 and Q3.3, but ends up with 90 rows instead of 30 (all combinations of 10 individuals, 3 loops of Q3.2, and 3 loops of Q3.3; the combinations will increase substantially for each group of columns in the actual data):

df %>% gather(loop_number, Q3.2, starts_with("Q3.2")) %>% 
  gather(loop_number, Q3.3, starts_with("Q3.3")) %>%
  mutate(loop_number = str_sub(loop_number,-2,-2))


   id       time loop_number        Q3.2        Q3.3
1   1 2009-01-01           1 -0.20591649  1.52718069
2   2 2009-01-02           1 -0.19811357 -0.40380049
3   3 2009-01-03           1  0.35147949 -1.02641801
...
89  9 2009-01-09           3 -0.58581232 -0.13187024
90 10 2009-01-10           3 -2.33393981 -0.48502131

Is there a way to use multiple calls to gather() like this, combining small subsets of columns like this while maintaining the correct number of rows?

解决方案

This approach seems pretty natural to me:

df %>%
  gather(key, value, -id, -time) %>%
  extract(key, c("question", "loop_number"), "(Q.\\..)\\.(.)") %>%
  spread(question, value)

First gather all question columns, use extract() to separate into question and loop_number, then spread() question back into the columns.

#>    id       time loop_number         Q3.2        Q3.3
#> 1   1 2009-01-01           1  0.142259203 -0.35842736
#> 2   1 2009-01-01           2  0.061034802  0.79354061
#> 3   1 2009-01-01           3 -0.525686204 -0.67456611
#> 4   2 2009-01-02           1 -1.044461185 -1.19662936
#> 5   2 2009-01-02           2  0.393808163  0.42384717

这篇关于收集多组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆