使用 R/dplyr 中的字典/列表协调数据集*列类型*(格式) [英] Reconcile dataset *column types* (formats) using a dictionary/list in R/dplyr

查看:42
本文介绍了使用 R/dplyr 中的字典/列表协调数据集*列类型*(格式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遵循 重命名请求 #67453183 我想使用字典对格式做同样的事情,因为它不会将不同类型的列放在一起.

<块引用>

我有一系列数据集和一本字典来将这些组合在一起.但我正在努力弄清楚如何自动化这一点.>假设这个数据和字典(实际的要长得多,因此我想自动化):

<预><代码>mtcarsA <- mtcars[1:2,1:3] %>% 重命名(mpgA = mpg, cyl_A = cyl) %>% as_tibble()mtcarsB <- mtcars[3:4,1:3] %>% 重命名(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)dic <- tibble(true_name = c(mpg_true", cyl_true"),nameA = c(mpgA", cyl_A"),nameB = c(mpg_B", B_cyl"),true_format = c("factor", "numeric"))

<块引用>

我希望将这些数据集(来自 A 年和 B 年)相互附加,然后将名称更改或合并为true_name"值....我想自动化合并所有具有重复名称的列".

为了将这些组合在一起,类型也需要相同.我在这里给出整个问题,因为也许有人也有更好的使用数据字典"的解决方案.

@ronakShah 在之前的查询中提出

pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%flatten_chr() ->值mtcars_all <- 列表(mtcarsA,mtcarsB)%>%map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))

这在前面的例子中效果很好,但如果格式不同,则不然.这里它抛出错误:

错误:无法组合..1$cyl_true..2$cyl_true >.

这个对#56773354 的回复提供了一个相关的解决方案,如果有一个完整列表类型,但不是按列名的类型列表,就像我一样.

期望的输出:

mtcars_all# 小费:4 x 3mpg_true cyl_true 显示<因素><数字><dbl>1 21 6 1602 21 6 1603 22.8 4 1084 21.4 6 258

解决方案

更简单的事情:

library(magrittr) # %<>% 很酷图书馆(dplyr)# 重命名很简单:renameA <- dic$nameArenameB <- dic$nameB名称(renameA)<- dic$true_name名称(重命名B)<- dic$true_namemtcarsA %<>% 重命名(all_of(renameA))mtcarsB %<>% 重命名(all_of(renameB))# 格式化有点难:格式 <- dic$true_format名称(格式)<- dic$true_namelapply(名称(格式),函数(x){# 没有好的编程方式可以做到这一点,我认为coercer <- 开关(格式[[x]],因子 = as.factor,数字 = as.numeric,警告(无法识别的格式"))mtcarsA[[x]] <<- coercer(mtcarsA[[x]])mtcarsB[[x]] <<- coercer(mtcarsB[[x]])})mtcars_all <- bind_rows(mtcarsA, mtcarsB)

在后台,您应该了解基 R 在 4.1.0 之前如何处理连接因子,以及这将如何改变.这里可能没有关系,因为 bind_rows 将使用 vctrs 包.

Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.

I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):


mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)

dic <- tibble(true_name  = c("mpg_true", "cyl_true"), 
              nameA = c("mpgA", "cyl_A"), 
              nameB = c("mpg_B", "B_cyl"),
              true_format = c("factor", "numeric")
)

I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.

And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.

@ronakShah in the previous query proposed

pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
  flatten_chr() -> val

mtcars_all <- list(mtcarsA,mtcarsB) %>%
  map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))

Which works great in the previous example but not if the formats vary. Here it throws error:

Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.

This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.

Desired output:

mtcars_all
# A tibble: 4 x 3

mpg_true cyl_true  disp
  <factor> <numeric> <dbl>
1    21     6   160
2    21     6   160
3    22.8   4   108
4    21.4   6   258

解决方案

Something simpler:

library(magrittr) # %<>% is cool
library(dplyr)

# The renaming is easy:

renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name

mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))

# Formatting is a little harder:

formats <- dic$true_format
names(formats) <- dic$true_name

lapply(names(formats), function (x) {
  # there's no nice programmatic way to do this, I think
  coercer <- switch(formats[[x]], 
                      factor = as.factor,
                      numeric = as.numeric,
                      warning("Unrecognized format") 
                    )
  mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
  mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})

mtcars_all <- bind_rows(mtcarsA, mtcarsB)

In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.

这篇关于使用 R/dplyr 中的字典/列表协调数据集*列类型*(格式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆