在 R 中将多列堆叠/熔化为多列 [英] stacking/melting multiple columns into multiple columns in R
问题描述
我正在尝试将数据帧的多个特定列融合/堆叠/收集到 2 列中,并保留所有其他列.我在stackoverflow上尝试了很多很多答案都没有成功(下面有一些).我基本上有一个类似于这篇文章的情况:重塑多组测量列(宽格式)成单列(长格式)只有更多的列要保留和组合.重要的是要提到我的年份列是因素,而且我的列比下面列出的示例多很多,所以我想称列名称而不是位置.
I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others. I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here: Reshaping multiple sets of measurement columns (wide format) into single columns (long format) only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
我根据上面的帖子尝试在 tidyr 中使用 pivot_longer ,这看起来非常相似,这导致了各种错误,具体取决于我所做的:
I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
我还以各种方式在 reshape2 中使用了melt,这些方式要么只熔化了值列,要么只熔化了年份列.如:
I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
我也尝试过基于其他帖子的 dplyr gather,但我发现很难理解帮助页面和帖子.明确我希望实现的目标:
I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts. To be clear what I am looking to achieve:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
我非常感谢这里的帮助.
I really appreciate the help here.
推荐答案
我们可以指定names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
<小时>
或者根据 ?pivot_longer
names_sep - names_sep 采用与separate() 相同的规范,可以是数字向量(指定要中断的位置),也可以是单个字符串(指定要拆分的正则表达式).
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
这意味着默认情况下正则表达式是 on 并且正则表达式中的 .
匹配任何字符而不是文字点.要获取文字值,请转义或将其放在方括号内
which implies that by default the regex is on and the .
in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
更新
对于更新后的数据集
Update
For the updated dataset
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
或者不使用 rename
,使用正则表达式查找
Or without the rename
, use regex lookaround
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\\.")
数据
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
这篇关于在 R 中将多列堆叠/熔化为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!