因子在 data.table 中的存储效率是否比字符更有效? [英] Are factors stored more efficiently in data.table than characters?

查看:8
本文介绍了因子在 data.table 中的存储效率是否比字符更有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我在某个地方(不记得在哪里)读到过这些因素实际上并不比 data.table 中的字符向量更有效.这是真的?我在争论是否继续使用因子将各种向量存储在 data.table 中.object.size 的近似测试似乎表明并非如此.

I though I had read somewhere (can't remember where) that factors were not actually more efficient than character vectors in data.table. Is this true? I was debating whether to continue using factors to store various vectors in data.table. An approximate test with object.size seems to indicate otherwise.

chars <- data.table(a = sample(letters, 1e5, TRUE))           # chars (not really)
string <- data.table(a = sample(state.name, 1e5, TRUE))       # strings
fact <- data.table(a = factor(sample(letters, 1e5, TRUE)))    # factor
int <- data.table(a = sample(1:26, 1e5, TRUE))                # int

mbs <- function(...) {
    ns <- sapply(match.call(expand.dots=TRUE)[-1L], deparse)
    vals <- mget(ns, .GlobalEnv)
    cat('Sizes:
',
        paste('	', ns, ':', round(sapply(vals, object.size)/1024/1024, 3), 'MB
'))
}

## Get approximate sizes?
mbs(chars, string, fact, int)
# Sizes:
#    chars : 0.765 MB
#    string : 0.766 MB
#    fact : 0.384 MB
#    int : 0.382 MB

推荐答案

您可能还记得 data.table FAQ 2.17 其中包含:

You may be remembering data.table FAQ 2.17 which contains :

stringsAsFactors 在 data.frame 中默认为 TRUE,但在 data.table 中为 FALSE,以提高效率.由于在 R 中添加了全局字符串缓存,字符项是指向单个缓存字符串的指针,转换为因子不再有性能优势.

stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.

(该部分已于 2012 年 7 月添加到 v1.8.2 的常见问题解答中.)

(That part was added to the FAQ in v1.8.2 in July 2012.)

在堆叠(rbindlist)之类的任务中,使用字符而不是因子有很大帮助.由于两个字符向量的 c() 只是串联,而两个因子列的 c() 需要遍历和合并两个因子级别,这更难编码并且需要更长的时间来执行.

Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c() of two character vectors is just the concatenation whereas a c() of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.

您注意到的是 64 位机器上 RAM 消耗的差异.因子存储为级别中项目的 integer 向量查找.integer 类型是 32 位的,即使在 64 位平台上也是如此.但是指针(character 向量是什么)在 64 位机器上是 64 位的.因此,在 64 位机器上,字符列将使用两倍于因子列的 RAM.32位没有区别.但是,通常这种成本会被字符向量上可能的更简单和更快的指令所抵消.[旁白:由于因子是 integer,它们不能包含超过 20 亿个唯一字符串.character 列没有这个限制.]

What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer vector lookup of the items in the levels. Type integer is 32bit, even on 64bit platforms. But pointers (what a character vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer they can't contain more than 2 billion unique strings. character columns don't have that limitation.]

这取决于您在做什么,但操作已针对 data.table 中的 character 进行了优化,因此这是我们的建议.基本上它节省了一个跃点(到关卡),我们可以通过比较指针值来比较不同表中的两个字符列,而根本不跳转,甚至到全局缓存.

It depends on what you're doing but operations have been optimized for character in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.

这也取决于列的基数.假设该列有 100 万行并包含 100 万个唯一字符串.将其存储为因子将需要 100 万个字符向量用于关卡,再加上 100 万个整数向量指向关卡的元素.那是 (4+8)*1e6 字节.另一方面,字符向量不需要级别,它只是 8*1e6 字节.在这两种情况下,全局缓存都以相同的方式存储 100 万个唯一字符串,因此无论如何都会发生这种情况.在这种情况下,字符列将比它是一个因素使用更少的 RAM.仔细检查用于计算 RAM 使用量的内存工具是否正确计算.

It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.

这篇关于因子在 data.table 中的存储效率是否比字符更有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆