我如何在 tidyr 中收集 2 组列 [英] how do I gather 2 sets of columns in tidyr

查看:30
本文介绍了我如何在 tidyr 中收集 2 组列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下结构:

key | category_x | 2009 | category_y | 2010
test

请求的示例数据

set.seed(24)
df <- data.frame(
key = 1:10,
category_x = paste0("stock_", 0:9),
'2008' = rnorm(10, 0, 10),
category_y = paste0("stock_", 0:9),
'2009' = rnorm(10, 0, 10),
category_z = paste0("stock_", 0:9),
'2010' = rnorm(10, 0, 10),
check.names=FALSE
)

我如何将其更改为:

key | category | year

我知道我可以使用:

library(magrittr)
library(dplyr)
library(tidyr)

data %>% gather(key, category, starts_with("category_"))

但这与年份无关.我看了收集多组列

but that doesn't deal with the year. I looked at Gather multiple sets of columns

但我没有得到提取传播命令.

but I don't get the extract spread commands.

推荐答案

如果我们使用 gather,我们可以分两步完成.首先,我们将以类别"开头的列名称从宽"格式改成长"格式,在下一步中,我们通过使用 matches 进行选择,对数字列名称执行相同操作.matches 可以正则表达式模式,所以 ^[0-9]+$ 的模式意味着我们匹配一个或多个数字 ([0-9]+) 从字符串的开头 (^) 到结尾 ($).我们可以使用 select 删除不需要的列.

If we are using gather, we can do this in two steps. First, we reshape from 'wide' to 'long' format for the column names that starts with 'category' and in the next step, we do the same with the numeric column names by selecting with matches. The matches can regex patterns, so a pattern of ^[0-9]+$ means we match one or more numbers ([0-9]+) from the start (^) to the end ($) of string. We can remove the columns that are not needed with select.

library(tidyr)
library(dplyr) 
gather(df, key, category, starts_with('category_')) %>%
     gather(key2, year, matches('^[0-9]+$')) %>%
     select(-starts_with('key'))

<小时>

或者使用 data.table 的开发版本,这会容易得多,因为 melt 可以为 measure 列采用多种模式.我们将 'data.frame' 转换为 'data.table' (setDT(df)),使用 melt 并用 in 指定 patternsmeasure 参数.我们还可以选择更改值"列的列名.'variable' 列被设置为 NULL,因为它在预期的输出中不需要.


Or using the devel version of data.table, this would be much easier as the melt can take multiple patterns for measure columns. We convert the 'data.frame' to 'data.table' (setDT(df)), use melt and specify the patterns with in the measure argument. We also have options to change the column names of the 'value' column. The 'variable' column is set to NULL as it was not needed in the expected output.

library(data.table)#v1.9.5+
melt(setDT(df), measure=patterns(c('^category', '^[0-9]+$')), 
           value.name=c('category', 'year'))[, variable:=NULL][]

这篇关于我如何在 tidyr 中收集 2 组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆