我如何在 tidyr 中收集 2 组列 [英] how do I gather 2 sets of columns in tidyr
问题描述
我有以下结构:
key | category_x | 2009 | category_y | 2010
test
请求的示例数据
set.seed(24)
df <- data.frame(
key = 1:10,
category_x = paste0("stock_", 0:9),
'2008' = rnorm(10, 0, 10),
category_y = paste0("stock_", 0:9),
'2009' = rnorm(10, 0, 10),
category_z = paste0("stock_", 0:9),
'2010' = rnorm(10, 0, 10),
check.names=FALSE
)
我如何将其更改为:
key | category | year
我知道我可以使用:
library(magrittr)
library(dplyr)
library(tidyr)
data %>% gather(key, category, starts_with("category_"))
但这与年份无关.我看了收集多组列
but that doesn't deal with the year. I looked at Gather multiple sets of columns
但我没有得到提取传播命令.
but I don't get the extract spread commands.
推荐答案
如果我们使用 gather
,我们可以分两步完成.首先,我们将以类别"开头的列名称从宽"格式改成长"格式,在下一步中,我们通过使用 matches
进行选择,对数字列名称执行相同操作.matches
可以正则表达式模式,所以 ^[0-9]+$
的模式意味着我们匹配一个或多个数字 ([0-9]+
) 从字符串的开头 (^
) 到结尾 ($
).我们可以使用 select
删除不需要的列.
If we are using gather
, we can do this in two steps. First, we reshape from 'wide' to 'long' format for the column names that starts with 'category' and in the next step, we do the same with the numeric column names by selecting with matches
. The matches
can regex patterns, so a pattern of ^[0-9]+$
means we match one or more numbers ([0-9]+
) from the start (^
) to the end ($
) of string. We can remove the columns that are not needed with select
.
library(tidyr)
library(dplyr)
gather(df, key, category, starts_with('category_')) %>%
gather(key2, year, matches('^[0-9]+$')) %>%
select(-starts_with('key'))
<小时>
或者使用 data.table
的开发版本,这会容易得多,因为 melt
可以为 measure
列采用多种模式.我们将 'data.frame' 转换为 'data.table' (setDT(df)
),使用 melt
并用 in 指定 patterns
measure
参数.我们还可以选择更改值"列的列名.'variable' 列被设置为 NULL,因为它在预期的输出中不需要.
Or using the devel version of data.table
, this would be much easier as the melt
can take multiple patterns for measure
columns. We convert the 'data.frame' to 'data.table' (setDT(df)
), use melt
and specify the patterns
with in the measure
argument. We also have options to change the column names of the 'value' column. The 'variable' column is set to NULL as it was not needed in the expected output.
library(data.table)#v1.9.5+
melt(setDT(df), measure=patterns(c('^category', '^[0-9]+$')),
value.name=c('category', 'year'))[, variable:=NULL][]
这篇关于我如何在 tidyr 中收集 2 组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!