如何重新编码数据帧值以仅保留那些满足特定集合的值,如何用“其他"替换其他值? [英] How to recode dataframe values to keep only those that satisfy a certain set, replace others with "other"

查看:62
本文介绍了如何重新编码数据帧值以仅保留那些满足特定集合的值,如何用“其他"替换其他值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种简洁的解决方案,最好使用 dplyr 来清理数据框列中的值,以便我可以保留它们,因为它们是与某个特定集合匹配的值,而其他则与它们不匹配不匹配将被重新编码为其他".

I'm looking for a concise solution, preferably using dplyr, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".

我有一个带有动物名字的数据框.有4个合法的动物名称,但其他行包含乱码而不是名称.我想清理该列,只保留合法的动物名称: zebra lion cow cat .

I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra, lion, cow, or cat.

library(tidyverse)
library(stringi)

real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
                                 pattern = c('[a-z]', '[0-9]', '[A-Z]')))

df <- tibble(animals = sample(c(animals, gibberish)))

> df

## # A tibble: 100 x 1
##    animals   
##    <chr>     
##  1 zebra     
##  2 zebra     
##  3 rbzal0677O
##  4 lion      
##  5 cat       
##  6 cfsgt0504G
##  7 cat       
##  8 jhixe2566V
##  9 lion      
## 10 zebra     
## # ... with 90 more rows

解决问题的一种方法-我觉得这很烦且不够简洁

使用 dplyr 1.0.2

df %>%
  mutate(across(animals, recode,
                "lion" = "lion",
                "zebra" = "zebra",
                "cow" = "cow",
                "cat" = "cat",
                .default = "other"))

这完成了,但是这段代码将每个动物的名字重复了两次,我觉得它很笨拙.是否有更清洁的解决方案,最好使用 dplyr ?

This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr?

下面给出了建议的答案

由于我喜欢 dplyr :: recode 的可读性,但不喜欢将每个动物的名字重复两次;并且由于以下答案使用了%in% –我可以在自己的 recode 解决方案中合并%in%来使其更简单/更简洁吗?

Since I do like the readability of dplyr::recode, but dislike having to repeat each animal name twice; and since the answers below utilize %in% – could I incorporate %in% in my own recode solution to make it simpler/more concise?

推荐答案

一种 base 解决方案:

keep_names <- c('lion', 'zebra', 'cow', 'cat')

within(df, animals[!animals %in% keep_names] <- "other")

带有 replace() dplyr 选项:

library(tidyverse)

df %>%
  mutate(animals = replace(animals, !animals %in% keep_names, "other"))

使用 recode(),您可以使用命名的字符向量对 !!! 进行无引号拼接.

With recode(), you can use a named character vector for unquote splicing with !!!.

df %>%
  mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))

注意: set_names(保留名称)等效于 setNames(keep_names,keep_names).

Note: set_names(keep_names) is equivalent to setNames(keep_names, keep_names).

这篇关于如何重新编码数据帧值以仅保留那些满足特定集合的值,如何用“其他"替换其他值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆