使用 pivot_longer 将宽列收集到多个长列中 [英] Gathering wide columns into multiple long columns using pivot_longer

查看:68
本文介绍了使用 pivot_longer 将宽列收集到多个长列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有使用gather从宽转换为长的代码,但我必须逐列执行此操作.我想使用 pivot_longer 将宽多列收集到多个长列中,而不是逐列.

例如,列 hf_1、hf_2、hf_3、hf_4、hf_5、hf_6 需要转为 2 列(hf_com - 此列的值来自宽 hf 列,值为 1、2、3、4、5、6)和(hf_com_freq - 此列的值为 1).

同样需要对列 ac_1、ac_2、ac_3、ac_4、ac_5、ac_6 进行处理.这些列需要转换为 2 列(ac_com - 此列的值来自宽 ac 列,值为 1,2,3,4,5,6)和(ac_com_freq - 此列的值为 1).

对于 cs_1、cs_2、cs_3、cs_4、cs_5、cs_6、cs_7、cs_8 列也需要进行同样的操作.这些列需要转换为 2 列(cur_sal - 此列的值来自宽 cs 列,值为 1,2,3,4,5,6,7,8)和(curr_sal_freq - 此列的值为 1).

对于 se_1、se_2、se_3、se_4、se_5、se_6、se_7、se_8 列也需要进行同样的操作.这些列需要转换为 2 列(exp_sal - 此列的值是 1,2,3,4,5,6,7,8 来自宽 se 列)和(exp_sal_freq - 此列的值为 1)

我试过看:

收集多组列>

和:

重塑多组测量列(宽格式)转换为单列(长格式)

但我无法获得所需的多个更长的列.

这里是输入数据:

structure(list(np_id = c("81", "83", "85", "88", "1", "2"), np_look_work = c("yes","yes", "yes", "yes", "yes", "yes"), np_current_work = c("no","yes", "no", "no", "no", "no"), np_before_work = c("no", "NULL","yes", "yes", "yes", "yes"), np_keen_move = c("yes", "yes", "no","no", "no", "no"), np_city_size = c("village", "超过 50 万居民",超过 50 万居民"、村庄"、城市达 2 万居民"、"100k - 199k 居民"), np_gender = c("male", "female", "female","男", "女", "男"), np_age = c("18 - 24 岁", "18 - 24 岁",超过 50 年"、18 - 24 年"、31 - 40 年"、超过 50 年"), np_education = c("中学", "职业", "中学", "中学","次要", "次要"), hf_1 = c(NA, NA, NA, "1", "1", "1"), hf_2 = c(NA, "1", "1", NA, NA, NA), hf_3 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), hf_4 = c(NA_character_, NA_character_, NA_character_, NA_character_,NA_character_, NA_character_), hf_5 = c(NA_character_, NA_character_,NA_character_、NA_character_、NA_character_、NA_character_)、hf_6 = c(NA_character_, NA_character_, NA_character_, NA_character_,NA_character_, NA_character_), ac_1 = c(NA, NA, NA, NA, NA,"1"), ac_2 = c(NA, "1", "1", NA, "1", NA), ac_3 = c(NA, NA,NA, NA, "1", NA), ac_4 = c(NA_character_, NA_character_,NA_character_、NA_character_、NA_character_、NA_character_), ac_5 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), ac_6 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), cs_1 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), cs_2 = c(NA,"1", "1", NA, "1", NA), cs_3 = c(NA, NA, NA, NA, NA, "1"),cs_4 = c(NA, NA, NA, "1", NA, NA), cs_5 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), cs_6 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), cs_7 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), cs_8 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), se_1 = c(NA,NA, "1", "1", NA, NA), se_2 = c(NA, NA, NA, NA, "1", NA),se_3 = c(NA, "1", NA, NA, "1", "1"), se_4 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), se_5 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), se_6 = c(NA_character_,NA_character_、NA_character_、NA_character_、NA_character_、NA_character_), se_7 = c(NA_character_, NA_character_, NA_character_,NA_character_, NA_character_, NA_character_), se_8 = c(NA,NA, NA, "1", NA, NA)), row.names = c(NA, 6L), class = "data.frame")

这是我试过的代码:

# 将 CSV 读入 R###MyData <- read.csv(file="task.csv", header=TRUE, sep=",", stringsAsFactors = F)# str(MyData)# 设置为字符串不是因素的一般选项选项(stringsAsFactors = FALSE)# 保存所有列名old_names1 <- colnames(MyData)new_names1 = c("np_id","np_look_work", "np_current_work","hf_1", "ac_1","hf_2", "ac_2","hf_3", "ac_3","hf_4", "ac_4","hf_5", "ac_5","hf_6", "ac_6","cs_1", "se_1","cs_2", "se_2","cs_3", "se_3","cs_4", "se_4","cs_5", "se_5","cs_6", "se_6","cs_7", "se_7","cs_8", "se_8","np_before_work","np_keen_move","np_city_size","np_gender","np_age","np_education")# 使用 old_names 和 new_names 向量重命名列df <- MyData %>% rename_at(vars(old_names1), ~ new_names1)# 为列look_job 使用yes 过滤行# 代码需要 dplyr::filter 以确保不使用 bas 过滤器df <- dplyr::filter(df, np_look_work == "yes")# 匹配所有以job结尾的字符串并替换为1# .*df <-df%>%mutate_all(funs(str_replace_all(., '.*job$', '1')))# 重新排列列df <- df%>% select(np_id,ends_with('work'),np_before_work:np_education,starts_with('hf_'),starts_with('ac_'), starts_with('cs_'), starts_with('se_'))# 将空格更改为 NAdf <- mutate_all(df, funs(na_if(.,"")))# 宽变长df1 <-pivot_longer(df,cols = -(starts_with("np_")),names_to = c(".value", "group"),names_sep = "_",values_drop_na = TRUE)

预期结果如下所示,hf_com、ac_com、curr_sal 和 exp_sal 有四列,值列分别为 hf_com_freq、ac_com_freq、curr_sal_freq 和 exp_sal_freq:

structure(list(np_id = c("81", "83", "85", "88", "1", "2"), np_look_work = c("yes","yes", "yes", "yes", "yes", "yes"), np_current_work = c("no", "yes","no", "no", "no", "no"), np_before_work = c("no", "NULL", "yes","yes", "yes", "yes"), np_keen_move = c("yes", "yes", "no", "no","no", "no"), np_city_size = c("village", "超过 50 万居民",超过 50 万居民"、村庄"、城市达 2 万居民"、"100k - 199k 居民"), np_gender = c("male", "female", "female","男", "女", "男"), np_age = c("18 - 24 岁", "18 - 24 岁",超过 50 年"、18 - 24 年"、31 - 40 年"、超过 50 年"), np_education = c("中学", "职业", "中学", "中学","secondary", "secondary"), hf_com = c("hf1", "hf1", "hf1", "hf1","hf1", "hf1"), hf_com_freq = c("", "", "", "1", "1", "1"), ac_com = c("ac1","ac1", "ac1", "ac1", "ac1", "ac1"), ac_com_freq = c("", "", "","", "", "1"), curr_sal = c("cs1", "cs1", "cs1", "cs1", "cs1","cs1"), curr_sal_freq = c("", "", "", "", "", ""), exp_sal = c("se1","se1", "se1", "se1", "se1", "se1"), exp_sal_freq = c("", "","1", "1", "", "")), row.names = c(NA, 6L), class = "data.frame")

实际结果是:

结构(列表(np_id = c(83",83",85",85",88",88"),np_look_work = c("yes", "yes", "yes", "yes", "yes", "yes"), np_current_work = c("yes", "yes", "no", "no", "no", "no"), np_before_work = c("NULL", "NULL", "yes", "yes", "yes","yes"), np_keen_move = c("yes", "yes", "no", "no", "no","no"), np_city_size = c("超过 50 万居民", "超过 50 万居民",超过 50 万居民",超过 50 万居民","村庄", "村庄"), np_gender = c("female", "female","女", "女", "男", "男"), np_age = c("18 - 24 岁",18 - 24 年"、超过 50 年"、超过 50 年"、"18 - 24 岁", "18 - 24 岁"), np_education = c("职业",职业"、中学"、中学"、中学"、中学"), group = c("2", "3", "1", "2", "1", "4"), hf = c("1", NA,NA, "1", "1", NA), ac = c("1", NA, NA, "1", NA, NA), cs = c("1",NA, NA, "1", NA, "1"), se = c(NA, "1", "1", NA, "1", NA)), class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -6L))

我能看到的问题是group列没有告诉我是hfaccs还是 根据我的需要.

解决方案

我找到了问题的答案:

pivot_longer - 将以hf"和ac"开头的宽格式列转换为单独列中的长格式

names_to 参数:

.value = 包含与原始列对应的单元格值的元数据

这些值以长格式旋转并添加到新列hf"和ac"中

列组"将原始列结尾(例如数字 1-6)转为长格式

names_pattern = regex 参数指定字符_",其中列名将被分解

df3 <- df %>%tidyr::pivot_longer(cols = c(starts_with("hf"),starts_with("ac"),starts_with("cs"),starts_with("se")),names_to = c(".value", "level"),names_pattern = "(.*)_(.*)")

I have code which converts from wide to long with gather but I have to do this column by column. I want to use pivot_longer to gather wide multiple columns with into multiple long columns rather than column by column.

For example, the columns hf_1, hf_2, hf_3, hf_4, hf_5, hf_6 need to be pivoted into 2 columns (hf_com - this column with values 1,2,3,4,5,6 from wide hf columns) and (hf_com_freq - this column with value 1).

The same needs to occur for the columns ac_1, ac_2, ac_3, ac_4, ac_5, ac_6. These columns need to be pivoted into 2 columns (ac_com - this column with values 1,2,3,4,5,6 from wide ac columns) and (ac_com_freq - this column with value 1).

The same needs to occur for the columns cs_1, cs_2, cs_3, cs_4, cs_5, cs_6, cs_7, cs_8. These columns need to be pivoted into 2 columns (cur_sal - this column with values 1,2,3,4,5,6,7,8 from wide cs columns) and (curr_sal_freq - this column with value 1).

The same needs to occur for the columns se_1, se_2, se_3, se_4, se_5, se_6, se_7, se_8. These columns need to be pivoted into 2 columns (exp_sal - this column with values 1,2,3,4,5,6,7,8 from wide se columns) and (exp_sal_freq - this column with value 1)

I have tried looking at:

Gather multiple sets of columns

and:

Reshaping multiple sets of measurement columns (wide format) into single columns (long format)

but I can't get the multiple longer columns that I need.

Here is the input data:

structure(list(np_id = c("81", "83", "85", "88", "1", "2"), np_look_work = c("yes", 
"yes", "yes", "yes", "yes", "yes"), np_current_work = c("no", 
"yes", "no", "no", "no", "no"), np_before_work = c("no", "NULL", 
"yes", "yes", "yes", "yes"), np_keen_move = c("yes", "yes", "no", 
"no", "no", "no"), np_city_size = c("village", "more than 500k inhabitants", 
"more than 500k inhabitants", "village", "city up to 20k inhabitants", 
"100k - 199k inhabitants"), np_gender = c("male", "female", "female", 
"male", "female", "male"), np_age = c("18 - 24 years", "18 - 24 years", 
"more than 50 years", "18 - 24 years", "31 - 40 years", "more than 50 years"
), np_education = c("secondary", "vocational", "secondary", "secondary", 
"secondary", "secondary"), hf_1 = c(NA, NA, NA, "1", "1", "1"
), hf_2 = c(NA, "1", "1", NA, NA, NA), hf_3 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), hf_4 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), hf_5 = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    hf_6 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_), ac_1 = c(NA, NA, NA, NA, NA, 
    "1"), ac_2 = c(NA, "1", "1", NA, "1", NA), ac_3 = c(NA, NA, 
    NA, NA, "1", NA), ac_4 = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), ac_5 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), ac_6 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), cs_1 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), cs_2 = c(NA, 
    "1", "1", NA, "1", NA), cs_3 = c(NA, NA, NA, NA, NA, "1"), 
    cs_4 = c(NA, NA, NA, "1", NA, NA), cs_5 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), cs_6 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), cs_7 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), cs_8 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), se_1 = c(NA, 
    NA, "1", "1", NA, NA), se_2 = c(NA, NA, NA, NA, "1", NA), 
    se_3 = c(NA, "1", NA, NA, "1", "1"), se_4 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), se_5 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), se_6 = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), se_7 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), se_8 = c(NA, 
    NA, NA, "1", NA, NA)), row.names = c(NA, 6L), class = "data.frame")

Here is the code I tried:

# Read CSV into R
###
MyData <- read.csv(file="task.csv", header=TRUE, sep=",", stringsAsFactors = F)
# str(MyData)

# set as general option that Strings are not factors
options(stringsAsFactors = FALSE)

# save all column names
old_names1 <- colnames(MyData)
new_names1 = c("np_id","np_look_work", "np_current_work",
              "hf_1", "ac_1", 
              "hf_2", "ac_2",
              "hf_3", "ac_3", 
              "hf_4", "ac_4",
              "hf_5", "ac_5",
              "hf_6", "ac_6",
              "cs_1", "se_1",
              "cs_2", "se_2",
              "cs_3", "se_3",
              "cs_4", "se_4",
              "cs_5", "se_5",
              "cs_6", "se_6",
              "cs_7", "se_7",
              "cs_8", "se_8",
              "np_before_work",
              "np_keen_move",
              "np_city_size",
              "np_gender",
              "np_age",
              "np_education")

# rename columns with old_names and new_names vectors
df <- MyData %>% rename_at(vars(old_names1), ~ new_names1)

# filter rows with yes for Column look_job
# code needs dplyr::filter to ensure bas filter not used
df <- dplyr::filter(df, np_look_work == "yes")

# match all strings ending with job and replace with 1
# .* 
df <- df %>% 
  mutate_all(funs(str_replace_all(., '.*job$', '1')))
# reorder columns
df <- df %>% select(np_id, ends_with('work'), np_before_work:np_education, starts_with('hf_'), 
                     starts_with('ac_'), starts_with('cs_'), starts_with('se_'))


# change the blanks to NA
df <- mutate_all(df, funs(na_if(.,"")))

# change wide to long
df1 <- pivot_longer(df, 
             cols = -(starts_with("np_")), 
             names_to = c(".value", "group"),
             names_sep = "_",
             values_drop_na = TRUE)

The expected results would look like this, having four columns for hf_com, ac_com, curr_sal, and exp_sal and the value columns being hf_com_freq, ac_com_freq, curr_sal_freq, and exp_sal_freq:

structure(list(np_id = c("81", "83", "85", "88", "1", "2"), np_look_work = c("yes", 
"yes", "yes", "yes", "yes", "yes"), np_current_work = c("no", "yes", 
"no", "no", "no", "no"), np_before_work = c("no", "NULL", "yes", 
"yes", "yes", "yes"), np_keen_move = c("yes", "yes", "no", "no", 
"no", "no"), np_city_size = c("village", "more than 500k inhabitants", 
"more than 500k inhabitants", "village", "city up to 20k inhabitants", 
"100k - 199k inhabitants"), np_gender = c("male", "female", "female", 
"male", "female", "male"), np_age = c("18 - 24 years", "18 - 24 years", 
"more than 50 years", "18 - 24 years", "31 - 40 years", "more than 50 years"
), np_education = c("secondary", "vocational", "secondary", "secondary", 
"secondary", "secondary"), hf_com = c("hf1", "hf1", "hf1", "hf1", 
"hf1", "hf1"), hf_com_freq = c("", "", "", "1", "1", "1"), ac_com = c("ac1", 
"ac1", "ac1", "ac1", "ac1", "ac1"), ac_com_freq = c("", "", "", 
"", "", "1"), curr_sal = c("cs1", "cs1", "cs1", "cs1", "cs1", 
"cs1"), curr_sal_freq = c("", "", "", "", "", ""), exp_sal = c("se1", 
"se1", "se1", "se1", "se1", "se1"), exp_sal_freq = c("", "", 
"1", "1", "", "")), row.names = c(NA, 6L), class = "data.frame")

The actual results are:

structure(list(np_id = c("83", "83", "85", "85", "88", "88"), 
    np_look_work = c("yes", "yes", "yes", "yes", "yes", "yes"
    ), np_current_work = c("yes", "yes", "no", "no", "no", "no"
    ), np_before_work = c("NULL", "NULL", "yes", "yes", "yes", 
    "yes"), np_keen_move = c("yes", "yes", "no", "no", "no", 
    "no"), np_city_size = c("more than 500k inhabitants", "more than 500k inhabitants", 
    "more than 500k inhabitants", "more than 500k inhabitants", 
    "village", "village"), np_gender = c("female", "female", 
    "female", "female", "male", "male"), np_age = c("18 - 24 years", 
    "18 - 24 years", "more than 50 years", "more than 50 years", 
    "18 - 24 years", "18 - 24 years"), np_education = c("vocational", 
    "vocational", "secondary", "secondary", "secondary", "secondary"
    ), group = c("2", "3", "1", "2", "1", "4"), hf = c("1", NA, 
    NA, "1", "1", NA), ac = c("1", NA, NA, "1", NA, NA), cs = c("1", 
    NA, NA, "1", NA, "1"), se = c(NA, "1", "1", NA, "1", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

The problem I can see is that the group column does not tell me if it is hf, ac, cs or se as I need.

解决方案

I have found the answer to my question:

pivot_longer - transforms the columns in wide format starting with 'hf' and 'ac' to long format in separate columns

names_to parameters:

.value = contains metadata on the cell values that correspond to the original columns

these values are pivoted in long format and added in a new columns "hf" and "ac"

column "group" has the original column endings (e.g. the numbers 1-6) pivoted to long format

names_pattern = regex argument specifying character "_" where column names are to be broken up

df3 <- df %>% 
  tidyr::pivot_longer(cols = c(starts_with("hf"), starts_with("ac"), starts_with("cs"), starts_with("se")),
                      names_to = c(".value", "level"), 
                      names_pattern = "(.*)_(.*)"

  )

这篇关于使用 pivot_longer 将宽列收集到多个长列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆