列是字符而不是因子有什么好的理由吗? [英] Is there any good reason for columns to be characters instead of factors?

查看:17
本文介绍了列是字符而不是因子有什么好的理由吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个想法似乎是一个愚蠢的问题,但在使用 R 几个月后我意识到我经常发现自己将字符串转换为因子,例如,tabulate 函数不起作用字符串.

This mind seem like a silly question, but after working with R for a couple of months I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

此时我正在考虑简单地将任何字符串转换为因子.但这引出了一个问题,是否有任何理由不这样做(除了对字符串本身进行操作之外)?

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?

推荐答案

因素有双重表示——标签";以及级别的底层编码.R 使用这些表示中的哪一种可能是微妙和令人困惑的.

Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

这可能令人困惑的一个例子是子集.这是一个命名向量、一个字符向量和一个具有默认(按字母顺序)级别的因子

One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical

通过 y 子集 x 匹配字符值到名称,但是通过 z 子集 x 使用底层编码.

Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

> x[y]
bar foo 
  2   1 
> x[z]
foo bar 
  1   2 

这可能会变得更加混乱,因为 R 可以在不同的语言环境中工作(例如,我使用的是 en_US 语言环境 -- 美国英语),并且不同语言环境的整理(排序)顺序可能不同-- 不同地区的默认级别可能不同.

This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.

这篇关于列是字符而不是因子有什么好的理由吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆