如何重新编码数据帧值以仅保留那些满足特定集合的值,如何用“其他"替换其他值? [英] How to recode dataframe values to keep only those that satisfy a certain set, replace others with "other"
问题描述
我正在寻找一种简洁的解决方案,最好使用 dplyr
来清理数据框列中的值,以便我可以保留它们,因为它们是与某个特定集合匹配的值,而其他则与它们不匹配不匹配将被重新编码为其他".
I'm looking for a concise solution, preferably using dplyr
, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".
我有一个带有动物名字的数据框.有4个合法的动物名称,但其他行包含乱码而不是名称.我想清理该列,只保留合法的动物名称: zebra
, lion
, cow
或 cat 代码>.
I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra
, lion
, cow
, or cat
.
library(tidyverse)
library(stringi)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- tibble(animals = sample(c(animals, gibberish)))
> df
## # A tibble: 100 x 1
## animals
## <chr>
## 1 zebra
## 2 zebra
## 3 rbzal0677O
## 4 lion
## 5 cat
## 6 cfsgt0504G
## 7 cat
## 8 jhixe2566V
## 9 lion
## 10 zebra
## # ... with 90 more rows
解决问题的一种方法-我觉得这很烦且不够简洁
使用 dplyr 1.0.2
df %>%
mutate(across(animals, recode,
"lion" = "lion",
"zebra" = "zebra",
"cow" = "cow",
"cat" = "cat",
.default = "other"))
这完成了,但是这段代码将每个动物的名字重复了两次,我觉得它很笨拙.是否有更清洁的解决方案,最好使用 dplyr
?
This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr
?
下面给出了建议的答案
由于我喜欢 dplyr :: recode
的可读性,但不喜欢将每个动物的名字重复两次;并且由于以下答案使用了%in%
–我可以在自己的 recode
解决方案中合并%in%
来使其更简单/更简洁吗?
Since I do like the readability of dplyr::recode
, but dislike having to repeat each animal name twice; and since the answers below utilize %in%
– could I incorporate %in%
in my own recode
solution to make it simpler/more concise?
推荐答案
一种 base
解决方案:
keep_names <- c('lion', 'zebra', 'cow', 'cat')
within(df, animals[!animals %in% keep_names] <- "other")
带有 replace()
的 dplyr
选项:
library(tidyverse)
df %>%
mutate(animals = replace(animals, !animals %in% keep_names, "other"))
使用 recode()
,您可以使用命名的字符向量对 !!!
进行无引号拼接.
With recode()
, you can use a named character vector for unquote splicing with !!!
.
df %>%
mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))
注意: set_names(保留名称)
等效于 setNames(keep_names,keep_names)
.
Note: set_names(keep_names)
is equivalent to setNames(keep_names, keep_names)
.
这篇关于如何重新编码数据帧值以仅保留那些满足特定集合的值,如何用“其他"替换其他值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!