从宽格式到长格式,结果多列 [英] from wide format to long format with results in multiple columns

查看:41
本文介绍了从宽格式到长格式,结果多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于以下数据框的数据,但每个组合都有大约十个字段,从 name1、adress1、city1 等开始

I have a data that looks like the following dataframe, but every combo has about ten fields, starting with name1, adress1, city1, etc

   id name1  adress1 name2  adress2  name3  adress3
1  1  John street a  Burt street d  chris street 1
2  2  Jack street b   Ben street e connor street 2
3  3  Joey     <NA>   Bob street f   <NA>     <NA>  

现在我想重新排列这些数据,让它更有用一点,它看起来应该是这样,但是要包含它来自哪个条目的信息:

Now I would like to rearrange this data so it is a bit more useful and it should look like so, but with the information from which entry it came from:

      id origin  names adresses
1  1      1   John street a
2  2      1   Jack street b
3  3      1   Joey     <NA>
4  1      2   Burt street d
5  2      2    Ben street e
6  3      2    Bob street f
7  1      3  chris street 1
8  2      3 connor street 2     

使用 tidyr 我可以得到一个长格式,但是我有一个包含所有变量名称的键列,name1、name2、name3、street1 等.

Using tidyr I can get a long format, but then I have a key column that contains all the variable names, name1, name2, name3, street1, etc.

我还尝试使用单独的数据帧,每个组合一个,例如一个用于名称的数据框,一个用于街道等.但是然后将所有内容重新连接在一起会导致错误的记录,因为您只能连接 id 并且以长格式复制此 ID.我也一直在研究 Reshape2,但这导致了同样的问题.

I also tried using separate dataframes, one for each combination, e.g. one dataframe for the names, one for the streets, etc. But then joining everything back together results in the wrong records, because you can only join on id and in a long format this ID is replicated. I have also been looking into Reshape2, but that results in the same issue.

我所看到的所有宽到长的转换都是当您想要转换为一列时.我正在寻找 10 列的最终结果,或者如示例中的 2 列.

All the conversions of wide to long I have seen are when you have one column you want to convert to. I'm looking for the end result in 10 columns, or as in the example 2 columns.

是否有我忽略的功能?

#code to generete the dataframes:
df <- data.frame(id = c(1,2,3), 
                 name1 = c("John", "Jack", "Joey"), 
                 adress1 = c("street a", "street b", NA), 
                 name2 = c("Burt", "Ben", "Bob"),
                 adress2 = c("street d", "street e", "street f"),
                 name3 = c("chris", "connor", NA),
                 adress3 = c("street 1", "street 2", NA),
                 stringsAsFactors = FALSE)


expecteddf <- data.frame(id = c(1,2,3,1,2,3,1,2), 
                         origin = c(rep(1, 3), rep(2, 3), rep(3, 2)), 
                         names = c("John", "Jack", "Joey", "Burt", "Ben", "Bob", "chris", "connor"), 
                         adresses = c("street a", "street b", NA, "street d", "street e", "street f", "street 1", "street 2"),
                         stringsAsFactors = FALSE


                   )

推荐答案

我们可以使用 melt 来自 data.table 的 devel 版本,它可以接受多个 measure 列的模式.安装'data.table'开发版的说明是here

We could use melt from the devel version of data.table which can take multiple patterns for the measure columns. Instructions to install the devel version of 'data.table' is here

我们将'data.frame'转换为'data.table'(setDT(df)),melt,并指定regexmeasure 参数的 patterns 中.删除 'names' 和 'address' 列的 NA 行.

We convert the 'data.frame' to 'data.table' (setDT(df)), melt, and specify the regex in the patterns of measure argument. Remove the rows that are NA for the 'names' and 'address' column.

library(data.table)#v1.9.5+
dM <- melt(setDT(df), measure=patterns(c('^name', '^adress')),
          value.name=c('names', 'address') )
dM[!(is.na(names) & is.na(address))]
# id variable  names  address
#1:  1        1   John street a
#2:  2        1   Jack street b
#3:  3        1   Joey       NA
#4:  1        2   Burt street d
#5:  2        2    Ben street e
#6:  3        2    Bob street f
#7:  1        3  chris street 1
#8:  2        3 connor street 2

<小时>

或者我们可以使用 base R 中的 reshape.

 dM2 <- reshape(df, idvar='id', varying=list(grep('name', names(df)), 
             grep('adress', names(df))), direction='long')

NA 行可以像在 data.table 解决方案中一样,在我们使用 创建逻辑索引后,使用标准的data.frame"索引来删除.na.

The NA rows can be removed as in the data.table solution by using standard 'data.frame' indexing after we create the logical index with is.na.

这篇关于从宽格式到长格式,结果多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆