清理因子级别(折叠多个级别/标签) [英] Cleaning up factor levels (collapsing multiple levels/labels)

查看:35
本文介绍了清理因子级别(折叠多个级别/标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

清理包含需要折叠的多个级别的因子的最有效(即有效/适当)方法是什么?即如何将两个或多个因子水平合二为一.

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.

这里有一个例子,是"和Y"两个级别应该折叠为是",否"和N"折叠为否":

Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS

<小时>

当然,一种选择是使用 sub 和朋友事先清理字符串.


One option is of course to clean the strings before hand using sub and friends.

另一种方法,是允许重复标签,然后删除它们

Another method, is to allow duplicate label, then drop them

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f) 

但是,有没有更有效的方法?

虽然我知道 levelslabels 参数应该是向量,但我尝试了列表、命名列表和命名向量,看看会发生什么不用说,以下内容都没有让我更接近我的目标.

While I know that the levels and labels arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens Needless to say, none of the following got me any closer to my goal.

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

推荐答案

更新 2:请参阅 Uwe 的回答,其中显示了执行此操作的新tidyverse"方式,该方式正在迅速成为标准.

UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

更新 1:现在确实允许重复标签(但不是级别!)(根据我上面的评论);见蒂姆的回答.

UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

原始答案,但仍然有用且有趣:有一个鲜为人知的选项可以将命名列表传递给 levels 函数,正是为了这个目的.列表的名称应该是所需的级别名称,元素应该是应该重命名的当前名称.有些人(包括 OP,请参阅 Ricardo 对 Tim 回答的评论)为了便于阅读而更喜欢这样.

ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST: There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.

x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No

levels 文档中所述;另请参阅那里的示例.

As mentioned in the levels documentation; also see the examples there.

value:对于 'factor' 方法,一个长度至少为数字的字符串向量'x' 的级别,或指定如何重命名的命名列表水平.

value: For the 'factor' method, a vector of character strings with length at least the number of levels of 'x', or a named list specifying how to rename the levels.

这也可以在一行中完成,就像 Marek 在这里所做的那样:https://stackoverflow.com/a/10432263/210673;levels<- 魔法在这里解释 https://stackoverflow.com/a/10491881/210673一>.

This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

这篇关于清理因子级别(折叠多个级别/标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆