清理因子水平(折叠多个水平/标签) [英] Cleaning up factor levels (collapsing multiple levels/labels)

查看:119
本文介绍了清理因子水平(折叠多个水平/标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

清除包含多个需要折叠的水平的因子的最有效(即有效/适当)方法是什么?也就是说,如何将两个或多个因子水平组合为一个.

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.

在此示例中,应将两个级别的是"和"Y"折叠为是",而将否"和"N"折叠为否":

Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS


当然可以选择使用sub和朋友在手之前清洁琴弦.


One option is of course to clean the strings before hand using sub and friends.

另一种方法是允许重复的标签,然后将其删除

Another method, is to allow duplicate label, then drop them

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 

但是,有没有更有效的方法?

虽然我知道levelslabels参数应该是向量,但是我尝试了列表,命名列表和命名向量来观察会发生什么 不用说,以下任何一项都使我离目标越来越近.

While I know that the levels and labels arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens Needless to say, none of the following got me any closer to my goal.

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

推荐答案

更新2:请参见Uwe的答案,其中显示了执行此操作的新"tidyverse"方法,该方法很快成为标准.

UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

更新1:现在确实允许使用重复的标签(但不能使用级别!)(根据我上面的评论);参见蒂姆的答案.

UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

原始答案,但仍然有用并且有兴趣: 出于这个目的,鲜为人知的选项是将命名列表传递给levels函数.列表的名称应为所需的级别名称,元素应为应重命名的当前名称.为了便于阅读,有些人(包括OP,请参阅Ricardo对Tim的回答的评论)更喜欢此内容.

ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST: There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.

x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No

levels文档中所述;另请参阅其中的示例.

As mentioned in the levels documentation; also see the examples there.

值:对于因子"方法, 长度至少为数字的字符串向量 级别的"x"或指定如何重命名的命名列表 水平.

value: For the 'factor' method, a vector of character strings with length at least the number of levels of 'x', or a named list specifying how to rename the levels.

这也可以像Marek一样在一行中完成: https://stackoverflow.com/a/10432263/210673 ; levels<-巫术在这里 https://stackoverflow.com/a/10491881/210673 进行解释.

This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

这篇关于清理因子水平(折叠多个水平/标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆