规范dydyverse方法,用于从查找表更新矢量的某些值 [英] Canonical tidyverse method to update some values of a vector from a look-up table
问题描述
我经常需要根据查找表重新编码数据帧列中的一些(不是全部!)值.我对解决问题的方法不满意.我希望能够以稳定,高效的方式进行操作.在编写自己的函数之前,我想确保我没有复制已经存在的标准.
I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.
## Toy example
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
## desired result
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
# 6 6 AA
# 7 7 !
我可以通过以下方式进行联接,合并,取消选择,但这不是很明确,我想要的步骤太多了.
I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.
## This works, but is more steps than I want
library(dplyr)
data %>%
left_join(lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x)) %>%
select(-new)
也可以使用 dplyr :: recode
完成,如下所示,将查找表转换为命名查找向量.我更喜欢 lookup
作为数据框,但是我对命名矢量解决方案没问题.我在这里担心的是 recode
是 Questioning 生命周期的阶段,因此我担心此方法不稳定.
It can also be done with dplyr::recode
, as below, converting the lookup table to a named lookup vector. I prefer lookup
as a data frame, but I'm okay with the named vector solution. My concern here is that recode
is the Questioning lifecycle phase, so I'm worried that this method isn't stable.
lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
mutate(x = recode(x, !!!lookup_v))
也可以用 stringr :: str_replace
完成,但是使用正则表达式进行全字符串匹配效率不高.我想有> forcats :: fct_recode
是 recode
的稳定版本,但我不需要 factor
输出(尽管 mutate(x = as.字符(fct_recode(x,!!! lookup_v)))
可能是到目前为止我最喜欢的选项...).
It could also be done with, say, stringr::str_replace
, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode
is a stable version of recode
, but I don't want a factor
output (though mutate(x = as.character(fct_recode(x, !!!lookup_v)))
is perhaps my favorite option so far...).
我曾经希望新的 rows_update()
系列的 dplyr
函数能够正常工作,但是对于列名严格要求,我不认为可以更新其加入的列.(而且它是实验,因此还不符合我的稳定性要求.)
I had hoped that the new-ish rows_update()
family of dplyr
functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)
我的要求摘要:
- 根据查找数据帧(最好)或命名向量(允许)更新单个数据列
- 并非所有数据值都包含在查找中-不存在的值不会被修改
- 必须在
character
类输入上工作.更一般地工作是很不错的. - 基本R和
tidyverse
包之外没有任何依赖项(尽管我也希望看到data.table
解决方案) - 没有使用处于生命周期阶段的功能,例如被取代或质疑.请注意任何实验性生命周期功能,因为它们具有未来的潜力.
- 简洁明了的代码
- 我不需要极端的优化,但是没有任何效率低下的问题(例如不需要时的正则表达式)
- A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
- Not all values in the data are included in the lookup--the ones that are not present are not modified
- Must work on
character
class input. Working more generally is a nice-to-have. - No dependencies outside of base R and
tidyverse
packages (though I'd also be interested in seeing adata.table
solution) - No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
- Concise, clear code
- I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)
推荐答案
直接的 data.table
解决方案,没有%in%
.
根据查找/数据表的长度,添加键可以显着提高性能,但是在这个简单的示例中不是这种情况.
A direct data.table
solution, without %in%
.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.
library(data.table)
setDT(data)
setDT(lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
data[lookup, x:=new, on=.(x=old)]
data
id x
1: 1 a
2: 2 a
3: 3 B
4: 4 C
5: 5 d
6: 6 AA
7: 7 !
这篇关于规范dydyverse方法,用于从查找表更新矢量的某些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!