因素是否比字符更有效地存储在data.table中? [英] Are factors stored more efficiently in data.table than characters?

查看:112
本文介绍了因素是否比字符更有效地存储在data.table中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我在某处读过(不记得在哪里),但是这些因素实际上并不比data.table中的字符向量更有效。这是真的?我在辩论是否继续使用因子来存储各种向量在data.table中。 object.size 的近似测试似乎表明了这一点。

I though I had read somewhere (can't remember where) that factors were not actually more efficient than character vectors in data.table. Is this true? I was debating whether to continue using factors to store various vectors in data.table. An approximate test with object.size seems to indicate otherwise.

chars <- data.table(a = sample(letters, 1e5, TRUE))           # chars (not really)
string <- data.table(a = sample(state.name, 1e5, TRUE))       # strings
fact <- data.table(a = factor(sample(letters, 1e5, TRUE)))    # factor
int <- data.table(a = sample(1:26, 1e5, TRUE))                # int

mbs <- function(...) {
    ns <- sapply(match.call(expand.dots=TRUE)[-1L], deparse)
    vals <- mget(ns, .GlobalEnv)
    cat('Sizes:\n',
        paste('\t', ns, ':', round(sapply(vals, object.size)/1024/1024, 3), 'MB\n'))
}

## Get approximate sizes?
mbs(chars, string, fact, int)
# Sizes:
#    chars : 0.765 MB
#    string : 0.766 MB
#    fact : 0.384 MB
#    int : 0.382 MB


推荐答案

data.table常见问题2.17其中包含:

You may be remembering data.table FAQ 2.17 which contains :


stringsAsFactors默认值为TRUE,但在data.table中为FALSE,由于全局字符串缓存添加到R,字符项是指向单个缓存字符串的指针,并且不再有转换为factor的性能优势。

stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.

(这部分在2012年7月的v1.8.2中被添加到FAQ中。)

(That part was added to the FAQ in v1.8.2 in July 2012.)

使用字符而不是因子有助于很多(rbindlist)。由于两个字符向量的 c()只是连接,而两个因子列的 c()遍历和联合两个因素级别,这两个因素级别更难以编码,执行时间更长。

Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c() of two character vectors is just the concatenation whereas a c() of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.

您注意到的是64位机器上RAM消耗的差异。因子存储为级别中项目的整数向量查找。类型 integer 是32位,即使在64位平台上。但是64位机器上的指针(字符向量是64位的)。因此,字符列将使用64位机器上的因子列的两倍的RAM。在32bit上没有差别。然而,通常这种成本将超过在字符向量上可能的更简单和更快的指令。 [Aside:因为因素是 integer ,它们不能包含超过20亿个唯一字符串。 字符列没有这个限制。]

What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer vector lookup of the items in the levels. Type integer is 32bit, even on 64bit platforms. But pointers (what a character vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer they can't contain more than 2 billion unique strings. character columns don't have that limitation.]

这取决于你在做什么,针对data.table中的字符进行了优化,所以这就是我们的建议。基本上,它保存一个跳(到级别),我们可以通过比较指针值而不跳跃,甚至到全局缓存,比较不同表中的两个字符列。

It depends on what you're doing but operations have been optimized for character in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.

这也取决于列的基数。假设该列为100万行,包含1百万个唯一字符串。存储它作为一个因素将需要一个100万字符向量的水平加上1百万的整数向量指向级别的元素。那是(4 + 8)* 1e6字节。另一方面,字符向量将不需要级别,它只是8 * 1e6字节。在这两种情况下,全局高速缓存以相同的方式存储1百万个唯一字符串,这样就会发生。在这种情况下,字符列将使用比RAM小的RAM。仔细检查用于计算RAM使用情况的内存工具是否正确计算。

It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.

这篇关于因素是否比字符更有效地存储在data.table中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆